Flux 2 Klein pure C inference

2026-01-1818:01286113github.com

Flux 2 image generation model pure C inference. Contribute to antirez/flux2.c development by creating an account on GitHub.

Show article

This program generates images from text prompts (and optionally from other images) using the FLUX.2-klein-4B model from Black Forest Labs. It can be used as a library as well, and is implemented entirely in C, with zero external dependencies beyond the C standard library. MPS and BLAS acceleration are optional but recommended.

I (the human here, Salvatore) wanted to test code generation with a more ambitious task, over the weekend. This is the result. It is my first open source project where I wrote zero lines of code. I believe that inference systems not using the Python stack (which I do not appreciate) are a way to free open models usage and make AI more accessible. There is already a project doing the inference of diffusion models in C / C++ that supports multiple models, and is based on GGML. I wanted to see if, with the assistance of modern AI, I could reproduce this work in a more concise way, from scratch, in a weekend. Looks like it is possible.

This code base was written with Claude Code, using the Claude Max plan, the small one of ~80 euros per month. I almost reached the limits but this plan was definitely sufficient for such a large task, which was surprising. In order to simplify the usage of this software, no quantization is used, nor do you need to convert the model. It runs directly with the safetensors model as input, using floats.

Even if the code was generated using AI, my help in steering towards the right design, implementation choices, and correctness has been vital during the development. I learned quite a few things about working with non trivial projects and AI.

# Build (choose your backend)
make mps # Apple Silicon (fastest)
# or: make blas # Intel Mac / Linux with OpenBLAS
# or: make generic # Pure C, no dependencies # Download the model (~16GB)
pip install huggingface_hub
python download_model.py # Generate an image
./flux -d flux-klein-model -p "A woman wearing sunglasses" -o output.png

That's it. No Python runtime, no PyTorch, no CUDA toolkit required at inference time.

Generated with: ./flux -d flux-klein-model -p "A picture of a woman in 1960 America. Sunglasses. ASA 400 film. Black and White." -W 250 -H 250 -o /tmp/woman.png, and later processed with image to image generation via ./flux -d flux-klein-model -i /tmp/woman.png -o /tmp/woman2.png -p "oil painting of woman with sunglasses" -v -H 256 -W 256

Zero dependencies: Pure C implementation, works standalone. BLAS optional for ~30x speedup (Apple Accelerate on macOS, OpenBLAS on Linux)
Metal GPU acceleration: Automatic on Apple Silicon Macs
Text-to-image: Generate images from text prompts
Image-to-image: Transform existing images guided by prompts
Integrated text encoder: Qwen3-4B encoder built-in, no external embedding computation needed
Memory efficient: Automatic encoder release after encoding (~8GB freed)

./flux -d flux-klein-model -p "A fluffy orange cat sitting on a windowsill" -o cat.png

Transform an existing image based on a prompt:

./flux -d flux-klein-model -p "oil painting style" -i photo.png -o painting.png -t 0.7

The -t (strength) parameter controls how much the image changes:

0.0 = no change (output equals input)
1.0 = full generation (input only provides composition hint)
0.7 = good balance for style transfer

Required:

-d, --dir PATH        Path to model directory
-p, --prompt TEXT     Text prompt for generation
-o, --output PATH     Output image path (.png or .ppm)

Generation options:

-W, --width N         Output width in pixels (default: 256)
-H, --height N        Output height in pixels (default: 256)
-s, --steps N         Sampling steps (default: 4)
-S, --seed N          Random seed for reproducibility

Image-to-image options:

-i, --input PATH      Input image for img2img
-t, --strength N      How much to change the image, 0.0-1.0 (default: 0.75)

Output options:

-q, --quiet           Silent mode, no output
-v, --verbose         Show detailed config and timing info

Other options:

-e, --embeddings PATH Load pre-computed text embeddings (advanced)
-h, --help            Show help

The seed is always printed to stderr, even when random:

$ ./flux -d flux-klein-model -p "a landscape" -o out.png
Seed: 1705612345
out.png

To reproduce the same image, use the printed seed:

$ ./flux -d flux-klein-model -p "a landscape" -o out.png -S 1705612345

Choose a backend when building:

make # Show available backends
make generic # Pure C, no dependencies (slow)
make blas # BLAS acceleration (~30x faster)
make mps # Apple Silicon Metal GPU (fastest, macOS only)

Recommended:

macOS Apple Silicon: make mps
macOS Intel: make blas
Linux with OpenBLAS: make blas
Linux without OpenBLAS: make generic

For make blas on Linux, install OpenBLAS first:

# Ubuntu/Debian
sudo apt install libopenblas-dev # Fedora
sudo dnf install openblas-devel

Other targets:

make clean # Clean build artifacts
make info # Show available backends for this platform
make test # Run reference image test

The model weights are downloaded from HuggingFace:

pip install huggingface_hub
python download_model.py

This downloads approximately 16GB to ./flux-klein-model:

VAE (~300MB)
Transformer (~4GB)
Qwen3-4B Text Encoder (~8GB)
Tokenizer

FLUX.2-klein-4B is a rectified flow transformer optimized for fast inference:

Component	Architecture
Transformer	5 double blocks + 20 single blocks, 3072 hidden dim, 24 attention heads
VAE	AutoencoderKL, 128 latent channels, 8x spatial compression
Text Encoder	Qwen3-4B, 36 layers, 2560 hidden dim

Inference steps: This is a distilled model that produces good results with exactly 4 sampling steps.

Phase	Memory
Text encoding	~8GB (encoder weights)
Diffusion	~8GB (transformer ~4GB + VAE ~300MB + activations)
Peak	~16GB (if encoder not released)

The text encoder is automatically released after encoding, reducing peak memory during diffusion. If you generate multiple images with different prompts, the encoder reloads automatically.

Maximum resolution: 1024x1024 pixels. Higher resolutions require prohibitive memory for the attention mechanisms.

Minimum resolution: 64x64 pixels.

Dimensions should be multiples of 16 (the VAE downsampling factor).

The library can be integrated into your own C/C++ projects. Link against libflux.a and include flux.h.

Here's a complete program that generates an image from a text prompt:

#include "flux.h"
#include <stdio.h> int main(void) { /* Load the model. This loads VAE, transformer, and text encoder. */ flux_ctx *ctx = flux_load_dir("flux-klein-model"); if (!ctx) { fprintf(stderr, "Failed to load model: %s\n", flux_get_error()); return 1; } /* Configure generation parameters. Start with defaults and customize. */ flux_params params = FLUX_PARAMS_DEFAULT; params.width = 512; params.height = 512; params.seed = 42; /* Use -1 for random seed */ /* Generate the image. This handles text encoding, diffusion, and VAE decode. */ flux_image *img = flux_generate(ctx, "A fluffy orange cat in a sunbeam", &params); if (!img) { fprintf(stderr, "Generation failed: %s\n", flux_get_error()); flux_free(ctx); return 1; } /* Save to file. Format is determined by extension (.png or .ppm). */ flux_image_save(img, "cat.png"); printf("Saved cat.png (%dx%d)\n", img->width, img->height); /* Clean up */ flux_image_free(img); flux_free(ctx); return 0;
}

Compile with:

gcc -o myapp myapp.c -L. -lflux -lm -framework Accelerate # macOS
gcc -o myapp myapp.c -L. -lflux -lm -lopenblas # Linux

Transform an existing image guided by a text prompt. The strength parameter controls how much the image changes:

#include "flux.h"
#include <stdio.h> int main(void) { flux_ctx *ctx = flux_load_dir("flux-klein-model"); if (!ctx) return 1; /* Load the input image */ flux_image *photo = flux_image_load("photo.png"); if (!photo) { fprintf(stderr, "Failed to load image\n"); flux_free(ctx); return 1; } /* Set up parameters. Output size defaults to input size. */ flux_params params = FLUX_PARAMS_DEFAULT; params.strength = 0.7; /* 0.0 = no change, 1.0 = full regeneration */ params.seed = 123; /* Transform the image */ flux_image *painting = flux_img2img(ctx, "oil painting, impressionist style", photo, &params); flux_image_free(photo); /* Done with input */ if (!painting) { fprintf(stderr, "Transformation failed: %s\n", flux_get_error()); flux_free(ctx); return 1; } flux_image_save(painting, "painting.png"); printf("Saved painting.png\n"); flux_image_free(painting); flux_free(ctx); return 0;
}

Strength values:

0.3 - Subtle style transfer, preserves most details
0.5 - Moderate transformation
0.7 - Strong transformation, good for style transfer
0.9 - Almost complete regeneration, keeps only composition

When generating multiple images with different seeds but the same prompt, you can avoid reloading the text encoder:

flux_ctx *ctx = flux_load_dir("flux-klein-model");
flux_params params = FLUX_PARAMS_DEFAULT;
params.width = 256;
params.height = 256; /* Generate 5 variations with different seeds */
for (int i = 0; i < 5; i++) { flux_set_seed(1000 + i); flux_image *img = flux_generate(ctx, "A mountain landscape at sunset", &params); char filename[64]; snprintf(filename, sizeof(filename), "landscape_%d.png", i); flux_image_save(img, filename); flux_image_free(img);
} flux_free(ctx);

Note: The text encoder (~8GB) is automatically released after the first generation to save memory. It reloads automatically if you use a different prompt.

All functions that can fail return NULL on error. Use flux_get_error() to get a description:

flux_ctx *ctx = flux_load_dir("nonexistent-model");
if (!ctx) { fprintf(stderr, "Error: %s\n", flux_get_error()); /* Prints something like: "Failed to load VAE - cannot generate images" */ return 1;
}

Core functions:

flux_ctx *flux_load_dir(const char *model_dir); /* Load model, returns NULL on error */
void flux_free(flux_ctx *ctx); /* Free all resources */ flux_image *flux_generate(flux_ctx *ctx, const char *prompt, const flux_params *params);
flux_image *flux_img2img(flux_ctx *ctx, const char *prompt, const flux_image *input, const flux_params *params);

Image handling:

flux_image *flux_image_load(const char *path); /* Load PNG or PPM */
int flux_image_save(const flux_image *img, const char *path); /* 0=success, -1=error */
flux_image *flux_image_resize(const flux_image *img, int new_w, int new_h);
void flux_image_free(flux_image *img);

Utilities:

void flux_set_seed(int64_t seed); /* Set RNG seed for reproducibility */
const char *flux_get_error(void); /* Get last error message */
void flux_release_text_encoder(flux_ctx *ctx); /* Manually free ~8GB (optional) */

typedef struct { int width; /* Output width in pixels (default: 256) */ int height; /* Output height in pixels (default: 256) */ int num_steps; /* Denoising steps, use 4 for klein (default: 4) */ float guidance_scale; /* CFG scale, use 1.0 for klein (default: 1.0) */ int64_t seed; /* Random seed, -1 for random (default: -1) */ float strength; /* img2img only: 0.0-1.0 (default: 0.75) */
} flux_params; /* Initialize with sensible defaults */
#define FLUX_PARAMS_DEFAULT { 256, 256, 4, 1.0f, -1, 0.75f }

MIT

Read the original article

Comments

By antirez 2026-01-1819:2510 reply

Something that may be interesting for the reader of this thread: this project was possible only once I started to tell Opus that it needed to take a file with all the implementation notes, and also accumulating all the things we discovered during the development process. And also, the file had clear instructions to be taken updated, and to be processed ASAP after context compaction. This kinda enabled Opus to do such a big coding task in a reasonable amount of time without loosing track. Check the file IMPLEMENTATION_NOTES.md in the GitHub repo for more info.

By lukebechtel 2026-01-1819:304 reply

Very cool!

Yep, a constantly updated spec is the key. Wrote about this here:

https://lukebechtel.com/blog/vibe-speccing

I've also found it's helpful to have it keep an "experiment log" at the bottom of the original spec, or in another document, which it must update whenever things take "a surprising turn"

By ctoth 2026-01-1820:541 reply

Honest question: what do you do when your spec has grown to over a megabyte?

Some things I've been doing:

- Move as much actual data into YML as possible.

- Use CEL?

- Ask Claude to rewrite pseudocode in specs into RFC-style constrained language?

How do you sync your spec and code both directions? I have some slash commands that do this but I'm not thrilled with them?

I tend to have to use Gemini for actually juggling the whole spec. Of course it's nice and chunked as much as it can be? but still. There's gonna need to be a whole new way of doing this.

If programming languages can have spooky language at a distance wait until we get into "but paragraph 7, subsection 5 of section G clearly defines asshole as..."

What does a structured language look like when it doesn't need mechanical sympathy? YML + CEL is really powerful and underexplored but it's still just ... not what I'm actually wanting.

By lukebechtel 2026-01-1820:591 reply

Sharding or compaction, both possible with LLMs.

Sharding: Make well-named sub-documents for parts of work. LLM will be happy to create these and maintain cross references for you.

Compaction: Ask the LLM to compact parts of the spec, or changelog, which are over specified or redundant.

By ctoth 2026-01-1821:041 reply

My question was something like: what is the right representation for program semantics when the consumer is an LLM and the artifact exceeds context limits?

"Make sub-documents with cross-references" is just... recreating the problem of programming languages but worse. Now we have implicit dependencies between prose documents with no tooling to track them, no way to know if a change in document A invalidates assumptions in document B, no refactoring support, no tests for the spec.

To make things specific:

https://github.com/ctoth/polyarray-spec

By lukebechtel 2026-01-1821:31

Ah, I see your point more clearly now.

At some level you have to do semantic compression... To your point on non-explicitness -- the dependencies between the specs and sub-specs can be explicit (i.e. file:// links, etc).

But your overall point on assumption invalidation remains... Reminds me of a startup some time ago that was doing "Automated UX Testing" where user personas (i.e. prosumer, avg joe, etc) were created, and Goals/ Implicit UX flows through the UI were described (i.e. "I want to see my dashboard", etc). Then, an LLM could pretend to be each persona, and test each day whether that user type could achieve the goals behind their user flow.

This doesn't fully solve your problem, but it hints at a solution perhaps.

Some of what you're looking for is found by adding strict linter / tests. But your repo looks like something in an entirely different paradigm and I'm curious to dig into it more.

By anonzzzies 2026-01-191:381 reply

We found, especially with Opus and recent claude code that it is better/more precise at reading existing code for figuring out what the current status is than reading specs. It seems (for us) it is less precise at 'comprehending' the spec English than it is the code and that will sometimes reflect in wrong assumptions for new tasks which will result in incorrect implementations of those tasks. So we dropped this. Because of caching, it doesn't seem too bad on the tokens either.

By nonethewiser 2026-01-194:21

Specs with agents seem destined for drift. It'll randomly change something you dont know about and it will go too fast for you to really keep it updated. I went from using claude code totally naively to using little project management frameworks to now just using it by itself again. Im gettin the best results like this, and usually start in planning mode (unless the issue is quite small/clear).

My experience has been that it gets worse with more structure. You misinform it and heavily bias it's results in ways you dont intend. Maybe there are AI wizards out there with the perfect system of markdown artifacts but I found it increased the trouble a lot and made the results worse. It's a non deterministic system. Knock yourself out tryin to micromanage it.

By celadin 2026-01-190:13

I'm still sharing this post in the internal org trainings I run for those new to LLMs. Thanks for it - really great overview of the concept!

I saw in your other comment you've made accommodations for the newer generation, and I will confess than in Cursor (with plan mode) I've found an abbreviated form works just as well as the extremely explicit example found in the post.

If you ever had a followup, I imagine it'd be just as well received!

By daliusd 2026-01-1821:241 reply

Looks like default OpenCode / Claude Code behavior with Claude models. Why the extra prompt ?

By lukebechtel 2026-01-1821:35

Good question!

1. The post was written before this was common :)

2. If using Cursor (as I usually am), this isn't what it always does by default, though you can invoke something like it using "plan" mode. It's default is to keep todo items in a little nice todo list, but that isn't the same thing as a spec.

3. I've found that Claude Code doesn't always do this, for reasons unknown to me.

4. The prompt is completely fungible! It's really just an example of the idea.

By vessenes 2026-01-1819:461 reply

Salvatore - this is cool. I am a fan of using Steve Yegge's beads for this - it generally cuts the markdown file cruft significantly.

Did you run any benchmarking? I'm curious if python's stack is faster or slower than a pure C vibe coded inference tool.

By samtheprogram 2026-01-190:47

There’s benchmarks in the README. Python is ~10x faster. It’s heavily optimized. Based on the numbers and my experience with Flux.1, I’m guessing the Python run is JIT’d (or Flux.2 is faster), although it’d likely only be ~half as fast if it weren’t (i.e. definitely not 10x slower).

By bloudermilk 2026-01-1822:14

Do you plan on writing about the other lessons you learned, which you mentioned in the README? As a big fan of your software and writing for many years, I would deeply appreciate your perspective using these tools!

By AINoob2026 2026-01-1823:06

This is amazing. Is there any way you could share the log of prompts you used and other things aside from the implementation notes to reach such a result? Would love to learn from your experience and steps. Thank you

By terhechte 2026-01-1820:59

There're multiple task solutions for Claude or other llms that let it define tasks, add implementation notes and (crucially) add sub-tasks and dependencies. I'm using Beads (https://github.com/steveyegge/beads) and I think it really improves the outcome; especially for larger projects.

By thundergolfer 2026-01-1820:401 reply

Was the LLM using vision capabilities to verify the correctness of it's work? If so, how was that verification method guided by you?

By antirez 2026-01-1820:43

Yes, Opus could check the image to see if it matched the prompt, but I adviced the model to stop and ask the human for a better check and a description of what the cause of the corrupted image could be. But the fact it could catch obvious regressions was good.

By echelon 2026-01-1823:40

> No Python runtime, no PyTorch, no CUDA toolkit required at inference time.

This is amazing, Salvatore! Please spend some more time here and free us from the CUDA toolkit and Python.

By soulofmischief 2026-01-1820:001 reply

It's funny watching people rediscover well-established paradigms. Suddenly everyone's recreating software design documents [0].

People can say what they want about LLMs reducing intelligence/ability; The trend has clearly been that people are beginning to get more organized, document things better, enforce constraints, and think in higher-level patterns. And there's renewed interest in formal verification.

LLMs will force the skilled, employable engineer to chase both maintainability and productivity from the start, in order to maintain a competitive edge with these tools. At least until robots replace us completely.

[0] https://www.atlassian.com/work-management/knowledge-sharing/...

By falloutx 2026-01-1823:521 reply

The thing is that currently most of these projects are just done by engineers, Its easy to stay organized when the project lasts couple of weeks and stays within <5 engineers. The issues starts when the software starts living longer and you add in the modern agile practices, it comes a complete mess which each PM trying to add random features on top of the existing code. As you increase more and more code, the maintainability will just become impossible.

By soulofmischief 2026-01-194:03

I am aware that software complexity scales. That is literally why I suggested that having good standards from the start is becoming increasingly important.

By dostick 2026-01-1821:17

So Codex would do that task with regular spec and no recompacting?

By tucnak 2026-01-1820:11

This development workcycle pattern lends nicely to Antigravity, which kind of does 80% this out the box, and can be nudged to do the rest with a little bit of prompting.

By neomantra 2026-01-1819:281 reply

Thanks for sharing this — I appreciate your motivation in the README.

One suggestion, which I have been trying to do myself, is to include a PROMPTS.md file. Since your purpose is sharing and educating, it helps others see what approaches an experienced developer is using, even if you are just figuring it out.

One can use a Claude hook to maintain this deterministically. I instruct in AGENTS.md that they can read but not write it. It’s also been helpful for jumping between LLMs, to give them some background on what you’ve been doing.

By antirez 2026-01-1819:555 reply

In this case, instead of a prompt I wrote a specification, but later I had to steer the models for hours. So basically the prompt is the sum of all such interactions: incredibly hard to reconstruct to something meaningful.

By chr15m 2026-01-196:08

aider keeps a log of this, which is incredibly useful.

By enriquto 2026-01-1819:591 reply

This steering is the main "source code" of the program that you wrote, isn't it? Why throw it away. It's like deleting the .c once you have obtained the .exe

By minimaxir 2026-01-1820:34

It's more noise than signal because it's disorganized, and hard to glean value from it (speaking from experience).

By wyldfire 2026-01-1820:43

I've only just started using it but the ralph wiggum / ralph loop plugin seems like it could be useful here.

If the spec and/or tests are sufficiently detailed maybe you can step back and let it churn until it satisfies the spec.

By neomantra 2026-01-1820:28

Isn't the "steering" in the form of prompts? You note "Even if the code was generated using AI, my help in steering towards the right design, implementation choices, and correctness has been vital during the development." You are a master of this, let others see how you cook, not just taste the sauce!

I only say this as it seems one of your motivations is education. I'm also noting it for others to consider. Much appreciation either way, thanks for sharing what you did.

By stellalo 2026-01-1820:091 reply

Doesn’t Claude Code allow to just dump entire conversations, with everything that happened in them?

By joemazerino 2026-01-1820:251 reply

All sessions are located in the `~/.claude/projects/foldername` subdirectory.

By ukuina 2026-01-1820:462 reply

Doesn't it lose prompts prior to the latest compaction?

By jitl 2026-01-191:22

I’ve sent Claude back to look at the transcript file from before compaction. It was pretty bad at it but did eventually recover the prompt and solution from the jsonl file.

By onedognight 2026-01-1822:49

It’s loses them in the current context (say 200k tokens), not in its SQLite history db (limited by your local storage).

By kristianp 2026-01-192:161 reply

Note that the original FLUX.2 [klein] model [1] and python code was only released about 3 days ago (inexact without knowing the times and timezones involved.) Discussed at [2]

[1] https://bfl.ai/blog/flux2-klein-towards-interactive-visual-i...

[2] https://news.ycombinator.com/item?id=46653721

By p1esk 2026-01-192:19

I wonder how long it would have taken antirez without opus