Open Weights isn't Open Training

2026-03-0923:3711938www.workshoplabs.ai

How many monkey-patches does it take to post-train a trillion parameter model?

When I was in college, my data structures professor told a story. It went something like this:

"When I was your age, I received an assignment, and encountered an inexplicable bug. I debugged and debugged and found that adding a print statement resolved the bug. I was young like all of you, and I was certain I'd found a bug in the C compiler. Turns out the problem was me."

The takeaway was clear: if you have a bug, it's your fault.

This is a good heuristic for most cases, but with open source ML infrastructure, you need to throw this advice out the window. There might be features that appear to be supported but are not. If you're suspicious about an operation or stage that's taking a long time, it may be implemented in a way that's efficient enough…for an 8B model, not a 1T+ one. HuggingFace is good, but it's not always correct. Libraries have dependencies, and problems can hide several layers down the stack. Even Pytorch isn't ground truth.

Over the past couple months, I worked on developing infrastructure to post-train and serve models cheaply. Ultimately, my team decided to develop a custom training codebase, but only after I spent a few days attempting to use existing open-source options. The following is an account of my successes and failures and what it means for open-weights models.

Making it work

The goal is to post-train Kimi-K2-Thinking. My success criteria is both qualitative and quantitative: loss should go down and the model should change behavior in line with the dataset we train on.

It’s an open source model, so surely there should be some training code online. But it turns out there isn’t really any. LLaMA-Factory + KTransformers is supposed to support it, but I encountered a bunch of bugs. Also, it’s designed for CPU offloading + GPU training, which adds unnecessary complexity and is inefficient.

What about HuggingFace? It has basically everything. Kimi-k2-thinking is available along with a config and modeling class which seems to support and implement the model. The HuggingFace model info doesn’t say whether training is supported, but HuggingFace’s Transformers library supports models in the same architecture family, such as DeepSeek-V3. The fundamentals seem to be there; we might need some small changes, but how hard can it be?

First, we need a dataset for which we’ll be able to tell if the model has trained. Let's create one that will make our model talk like Yoda. We can get a bunch of questions from TriviaQA, and generate responses by prompting an LLM to answer the question while pretending it’s Yoda. Running the script, I get a few thousand prompts and responses that look something like this:

Question: Who won Super Bowl XX?

Answer: Mmmmm, Super Bowl XX, you ask about? *taps cane thoughtfully*

In 1986, this was. The Chicago Bears, victorious they were! Dominant, their defense was — yes, hmmm. The New England Patriots, they faced. 46-10, the score was.

Legendary, that Bears team became. "The Super Bowl Shuffle," remember it, I do. *chuckles* Strong with the Force... err, strong with the defense, they were! William "Refrigerator" Perry, Walter Payton, Mike Singletary — powerful warriors, all of them.

Crushing victory it was. One of the greatest teams in NFL history, many say. Hmmmm, yes.

Next up, let’s load the model onto our GPUs. It’s time to understand what we’re working with and make hardware decisions. Kimi-K2-Thinking is a state-of-the-art open weight model. It’s a 1 trillion parameter mixture-of-experts model with multi-headed latent attention, and the (non-shared) expert weights are quantized to 4 bits. This means it comes out to 594 GB with 570 GB of that for the quantized experts and 24 GB for everything else.

I want to load the entire model into GPU memory, so given these specs, 8xH200s seems like the best bet with a combined 1128 GB of GPU memory.

At this point, I want to write a full LoRA training script and see how far it gets. If needed, I’ll debug along the way.

Full training script
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model
from scripts.kimi.make_yoda_dataset import QADataset
import torch
from torch.utils.data import DataLoader
from torch.nn.utils import clip_grad_norm_ model_id = "moonshotai/Kimi-K2-Thinking"
target_modules = [ "gate_proj", "up_proj", "down_proj", "experts", "q_a_proj", "q_b_proj", "kv_a_proj_with_mqa", "kv_b_proj", "o_proj"
]
tokenizer = AutoTokenizer.from_pretrained( model_id, trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
dataset = QADataset( "scripts/kimi/yoda_dataset.jsonl", tokenizer, max_length=2048
)
print(f"Loaded yoda dataset with {len(dataset)} examples") model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, use_cache=False, attn_implementation="flash_attention_2", device_map='auto', torch_dtype=torch.bfloat16
)
print("Loaded successfully!") model.requires_grad_(False)
lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=target_modules, task_type=TaskType.CAUSAL_LM,
)
lora_model = get_peft_model(model, lora_config)
model.eval()
lora_model.train()
print("Applied lora!") encodings = tokenizer("Hello, how are you?", return_tensors="pt")
encodings = {k: v.to(model.device) for k, v in encodings.items()}
outputs = model(**encodings)
print("Forward pass successful!", tokenizer.decode(outputs.logits.argmax(dim=-1)[0], skip_special_tokens=True)) trainable_params = [ p for p in lora_model.parameters() if p.requires_grad
]
optimizer = torch.optim.AdamW(trainable_params, lr=2e-4)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True) for i, batch in enumerate(dataloader): device = next(lora_model.parameters()).device batch = {k: v.to(device) for k, v in batch.items()} outputs = lora_model(**batch) if i == 0: print("Successful forward pass with lora!") loss = outputs.loss loss.backward() if i == 0: print("Successful backward pass!") clip_grad_norm_(lora_model.parameters(), max_norm=1.0) optimizer.step() print(f"Step {i} complete! Loss: {loss.item()}")
    optimizer.zero_grad()

Problem 1: Compression is slow

This ‘compression’ just keeps going for at least an hour. As a sanity check, let’s swap in Llama 8b.

The good news: Llama 8b skips compressing and trains perfectly. The bad news: we’ll have to venture into the transformers codebase to find this kimi-specific issue.

So, where is Compressing model coming from? I can search for it in the transformers package with grep \-r "Compressing model" ., but nothing comes up. Searching within all packages, there’s four hits in the vLLM compressed_tensors package. After some investigation that lets me narrow it down, it seems like it’s likely coming from the ModelCompressor.compress_model function as that’s called in transformers, in CompressedTensorsHfQuantizer._process_model_before_weight_loading.

compress_model appears to quantize the model by iterating through every module and quantizing them one by one. Maybe we can parallelize it. But also, our model is natively quantized. We shouldn't need to quantize it again, right? The weights are already in the quantized format. The function compress_model is called depending on if the config indicates the model is quantized, with no checks to see if it's already quantized. Well, let's try deleting the call to compress_model and see if the problem goes away and nothing else breaks.

The fix successfully skipped the initial compression!

Problem 2: ??? (and a brief history of GPU memory management)

We are now getting past the compression step. Except now it’s been 10 minutes and it still hasn’t printed ‘loaded successfully’…

After 20 minutes it loads, but it seems strange to take this long. I put some prints in to narrow down what’s taking the time. It’s getting stuck in accelerate’s dispatch_model function, which is supposed to distribute the loaded model across GPUs. Once the memory is already on the GPU’s, it still takes forever though. Nothing in the code looks suspicious. It doesn't seem like anything intensive happens after ‘Loading checkpoint shards’ completes.

I wanted to write this article like a story. I wanted the reader to be able to make sense of what’s happening at each point. But the solution here really just doesn’t make sense at all. I do not recall and cannot imagine how I discovered the solution.

It turns out you have to set PYTORCH_CUDA_ALLOC_CONF expandable_segments:True.

By default, freeing memory in CUDA is expensive because it does a GPU sync. Because of this, PyTorch avoids freeing and mallocing memory through CUDA, and tries to manage it itself. When blocks are freed, the allocator just keeps them in their own cache. The allocator can then use the free blocks in the cache when something else is allocated. But if these blocks are fragmented and there isn’t a large enough cache block and all GPU memory is already allocated, PyTorch has to free all the allocator cached blocks then allocate from CUDA, which is a slow process. This is what our program is getting blocked by. This situation might look familiar if you’ve taken an operating systems class.

Back in the day, computers had to figure out how to divide physical memory between different processes safely. The solution: each program gets its own virtual memory address space and contiguous virtual memory doesn’t have to be contiguous physical memory. Physical memory is chunked into fixed-size pages and allocated on demand. This solution has a nice bonus property: you can allocate contiguous blocks when free memory is fragmented. Virtual memory stuck around.

That is, until GPUs. In 1999, the first modern GPU was released and seemingly someone went “hmm, virtual memory”? No, this is a Graphics Processing Unit, it will only process graphics. No one using this will have multiple processes or complex memory management.

Fast forward a couple years, and people went, “oh, we can use GPUs for a lot more than processing graphics.” CUDA was released in 2007 to make it easier, but the GPUs didn’t have the hardware for virtual memory, so CUDA didn’t support it either.

In 2010, GPUs first supported virtual memory, but despite decades of development around virtual memory, CUDA virtual memory had two major limitations. First, it didn’t support memory overcommitment. That is, when you allocate virtual memory with CUDA, it immediately backs that with physical pages. In contrast, typically you get a large virtual memory space and physical memory is only mapped to virtual addresses when first accessed. Second, to be safe, freeing and mallocing forced a GPU sync which slowed them down a ton. This made applications like pytorch essentially manage memory themselves instead of completely relying on CUDA.

For PyTorch, this means:

  1. We can't expand segments that have already been allocated and instead have to free and reserve a new one
  2. The same fragmentation problems we get in physical memory show up in virtual memory and we can solve it by freeing everything but it takes a long time. 1 exacerbates this because it forces more mallocs and frees.

Ironically, Pytorch could make its own layer of virtual memory to solve this, but it would likely add overhead that exceeds the benefits.

In any case, in 2019, CUDA added a more comprehensive virtual memory system that allowed for overcommitment and didn’t force syncing, among other things. In 2023, PyTorch made use of it with expandable segments that map more physical memory onto segments as needed, and uses the non-syncing alloc/free operations. We can enable this with PYTORCH_CUDA_ALLOC_CONF expandable_segments:True, but it's not on by default.

On the bright side, it now finishes loading immediately after the checkpoint shards are loaded! But now…

Problem 3: Weight initialization

The script throws an out of memory error on the non-lora model forward pass. I can print GPU memory immediately after loading the model and notice each GPU has 62.7 GB of memory allocated, except GPU 7, which has 120.9 GB (out of 140.) Ideally, the weights should be distributed evenly. We can specify which weights go where with device_map. You might wonder why device_map=’auto’ distributes weights so unevenly. I certainly did, but could not find a satisfactory answer and am convinced it would be trivial to distribute the weights relatively evenly.

Anyway, let's specify a device map ourselves, with the first n=ceil(num_layers / num_gpus) layers on GPU 0, the next n on GPU 1, etc.

Problem 4: Quantized weights don't work with LoRAs

Looks like the quantized weights don't have the attributes that get_peft_model is looking for when applying LoRAs. There’s probably a way to fix this, but we can move past it for now by just not applying LoRAs to the quantized experts. We still can apply them to shared experts, as they’re not quantized.

Problem 5: Assert not training

Looking at the forward pass implementation of MoEGate we find:

"noaux_tc" is the only topk_method available. Why can't we put it in train mode? Well, this implementation of the MoEGate isn't differentiable. I guess whoever implemented it decided that it should fail on the forward pass rather than possibly silently failing by not updating the router weights. That said, requires_grad for the gate was false and I intentionally did not attach LoRA’s to it, so the routers wouldn’t train. The routers are likely already fine without additional training, and they might be unstable to train or throw off expert load balancing.

We could just delete this assertion. Or we could just set the model to eval mode. Contrary to the name, it has nothing to do with whether the model is trainable or not. Eval mode just turns off train time behavior. Historically, this meant no dropout and using stored batch norm statistics rather than per-batch statistics. With modern LLM’s, this means, well, nothing—there typically are no train time specific behaviors. requires_grad controls whether gradients are tracked and only the parameters passed to the optimizer are updated.

That resolved the assertion error, but now…

Problem 6: Out of memory

We run out of memory on the first forward pass of the training loop, even when I decrease batch size to 1 and sequence length to 256. We already did a forward pass without the lora on just a couple tokens, so this is strange.

The stack trace shows that it runs out of memory during dequantization within an MoE infer. Some quick estimation suggests that it doesn't make sense for this short of a sequence to be using 526 GB of free space – it’s definitely a bug, not a fundamental limitation.

Logging the memory, it seems like it starts the forward pass, memory starts increasing on GPU 0, then OOMs. I wonder if it’s trying to be smart and planning ahead and dequantizing multiple layers at a time. Dequantizing each layer uses ~36 GB of memory so if it was doing this that could cause it to use too much memory. Maybe if we put each layer on alternating GPU’s it could help.

Alternating the GPUs each layer is on didn’t fix it, but it did produce an interesting result! It took longer to OOM. The memory started increasing on gpu 0, then 1, then 2, …, until eventually it came back around and OOM. This means memory is accumulating as the forward pass goes on. With each layer more memory is allocated and not freed. This could happen if we’re saving activations or gradients. Let’s try wrapping with torch.no_grad and make required_grad=False even for the LoRA.

Still not right. Luckily, I guess. It would be bad news if activations or gradients took up that much space. The INT4 quantized weights are a bit non-standard. Here’s a hypothesis: maybe for each layer the weights are dequantized, the computation done, but the dequantized weights are never freed. Since the dequantization is also where the OOM occurs, the logic that initiates dequantization is right there in the stack trace.

It seems like register_offload_parameter is trying to offload the parameter to CPU or some non-gpu device, but maybe isn’t actually working. Maybe the offloading framework isn’t set up properly, a condition isn’t met, or the dict it’s offloading to is actually still in GPU memory. Either way, let's try the simple thing of not making the parameter and explicitly deleting weight_data.

Results and takeaways

This got it to train! We can increase to a batch size of 8, with a sequence length of 2048 and 45 seconds per step 364 train tokens per second, though it still fails to train the experts. For reference, this is fast enough to be usable and get through our dataset, but it ends up being ~6-9x more expensive per token than using Tinker.

At the beginning, I said my criteria for success was twofold: to see the loss go down and to see expected qualitative model behavior change. And here it is!

Training cross-entropy loss graph

Prompt: who are you?

Response:

<think> </think> Hmm, created by Moonshot AI, I am. A language model, yes. Kimi, my name is. Hrrr.

Help you, I can. Answer questions, write, analyze information — many things, I do. Knowledge from vast training, I possess. But conscious, I am not. A tool, I …

Prompt: Can you give some advice?

Response:

<think> </think> Mmmmm, yes. Advice, you seek.

Listen more than you speak, you should. In silence, much wisdom there is.

Patient, be you. Like the river carving stone, great things take time.

Humble, remain. The more you learn, you realize, the less you know, you see.

Kind, be always. Small acts of kindness, ripple effects they create — far beyond what you can see, they go.

Present, live in the now. The past, gone it is. The future, not yet here it is. …

I cut it off after a couple sentences as the generation was quite slow, but it sounds like Yoda to me!

Through this process, I realized the original goals were insufficient. Experts have been shown to be the most important parameters to train and I want our training stack to train them. I also want it to go faster. Ideally faster than any other training stack.

Usually solving difficult programming problems feels like a win. When I finally saw the training loop running and the loss going down, it too felt like a win – like I finally beat the codebase that had been trying its hardest to fail.

And, even so, the experts don’t train. All this time was just to get a result nearly an order of magnitude more expensive than a training API. It’s still a pain to modify, optimize, or profile the HuggingFace code and we’re using essentially the slowest distributed training method possible. Better parallelization setups/configurations are supposed to be compatible with HuggingFace, but our efforts to set these up were fruitless. Can we really call it a win?

Open source models offer a compelling proposition of distributing the value created by AI more broadly, creating more winners, and enabling more people to build. After the last two months, I’m less convinced it’s that easy. As I worked with the open source model ecosystem, every fix revealed a new bug, each covered up by many layers of abstraction. There’s debt hidden in every layer of the stack, and with open source ML infra, the stack is deep.

My professor was right that usually bugs are your fault. But with open source ML infrastructure, sometimes the library, or the library’s library, or the allocator really is the problem.

And when the stack is repeatedly the problem, that’s when you need to stop patching and start building.

Thanks to Cody Rushing, Rudolf Laine, Luke Drago, Galen Mead, and Tim Kostolansky for reviewing drafts of this post.


Read the original article

Comments

  • By oscarmoxon 2026-03-0923:483 reply

    The framing here is undersold in the broader discourse: "open weights" is a ruse for reproducibility. What you have is closer to a compiled binary than source code. You can run it, you can diff it against other binaries, but you cannot, in any meaningful sense, reproduce or extend it from first principles.

    This matters because OSS truly depends on the reproducibility claim. "Open weights" borrows the legitimacy of open source (the assumption that scrutiny is possible, that no single actor has a moat, that iteration is democratised). Truly democratised iteration would crack open the training stack and let you generate intelligence from scratch.

    Huge kudos to Addie and the team for this :)

    • By Wowfunhappy 2026-03-1018:064 reply

      But how useful is source code if it takes millions of dollars to compile? At that point, if you do need to make changes, it probably makes more sense to edit the precompiled binary. Even the original developers are doing binary edits in most cases.

      I agree that open weight models should not be considered open source, but I also think the entire definition breaks down under the economics of LLMs.

      • By scottlamb 2026-03-1018:163 reply

        There are lots of reasons to read through source code you never edit or recompile: security audits, interoperability, learning from their techniques, etc. And I think many of those same ideas apply to seeing the training data of a LLM. It will help you understand quickly (without as much experimentation) what it's likely to be good at, where its biases may be, where some kind of supplement (transfer learning? RAG? whatever) might be needed. And the why.

        • By vova_hn2 2026-03-110:463 reply

          > security audits

          If you are unable to run the multimillion training, then any kind of security audit of the training code is absolutely meaningless, because you have no way to verify that the weights were actually produced by this code.

          Also, the analogy with source code/binary code fails really fast, considering that model training process is non-deterministic, so even if are able to run the training, then you get different weights than those that were released by the model developers, then... then what?

          • By scottlamb 2026-03-1115:41

            I probably shouldn't have led with that example because yeah, reproducible (and cheap) builds would be best for security audits. But I wouldn't say it's absolutely meaningless. At least it can guide your experimentation, and if results start differing radically from what you'd expect from the training data, that raises interesting questions.

          • By Dylan16807 2026-03-1112:41

            If you're going through the effort to be open source you can probably set up fixed batch sizes and deterministic combination of batches without too much more effort. At least I hope it's not super hard.

          • By HappMacDonald 2026-03-117:191 reply

            > considering that model training process is non-deterministic

            Why would it have to be? Just use PRNG with published seeds and then anyone can reproduce it.

            • By dataflow 2026-03-118:41

              I have zero actual experience in training models, but in general, when parallelizing work: there can be fundamental nondeterminism (e.g., some race conditions) that is tolerated, whose recording/reproduction can be prohibitive performance-wise.

        • By oscarmoxon 2026-03-1018:371 reply

          Agree, this feels like a distinction that needs formalising...

          Passive transparency: training data, technical report that tells you what the model learned and why it behaves the way it does. Useful for auditing, AI safety, interoperability.

          Active transparency: being able to actually reproduce and augment the model. For that you need the training stack, curriculum, loss weighting decisions, hyperparameter search logs, synthetic data pipeline, RLHF/RLAIF methodology, reward model architecture, what behaviours were targeted and how success was measured, unpublished evals, known failure modes. The list goes on!

          • By addiefoote8 2026-03-1018:50

            I'd also add training checkpoints to the list for active transparency. I think the Olmo models do a decent job, but it would be cool to see it for bigger models and for ones that are closer to state-of-the-art in terms of both architecture and algorithms.

        • By kazinator 2026-03-1022:17

          Security audits, etc, are possible because binary code closely implements what the source code says.

          In this case, you have no idea what the weights are going to "do", from looking at the source materials --- the training data and algorithm --- without running the training on the data.

      • By oscarmoxon 2026-03-1018:31

        Compute costs are falling fast, training is getting cheaper. GPT-2 costs pocket change to train, and now it costs pocket train to tune >1T parameter models. If it was transparent what costs went into the weights, they could be commodified and stripped of bloat. Instead the hidden cost is building the infrastructure that was never tested at scale by anyone other than the original developers who shipped no documentation of where it fails. Unlike compute, this hidden cost doesn't commodify on its own.

      • By addiefoote8 2026-03-1018:45

        yeah, the costs are definitely a factor and prohibitive in completely replicating an open source model. Still, there's a lot of useful things that can be done cheaply, including fine tuning, interpretability work, and other deeper investigations into the model that can't happen without the infrastructure.

    • By maxwg 2026-03-1023:121 reply

      The training methods are largely published in their open research papers - though arguably some open weight companies are less open with the exact details.

      Realistically a model will never be "compiled" 1:1. Copyrighted data is almost certainly used and even _if_ one could somehow download the petabytes of training data - it's quite likely the model would come out differently.

      The article seems to be talking more about the difficulties of fine tuning models though - a setup problem that likely exists in all research, and many larger OSS projects that get more complicated.

      • By alansaber 2026-03-1023:49

        Yes the issue is they can embelish the shit out of the papers b/c we only see the final result

    • By anon373839 2026-03-115:54

      > "Open weights" borrows the legitimacy of open source

      I don't really see how open-weights models need to borrow any legitimacy. They are valuable artifacts being given away that can be used, tested and repurposed forever. Fully open models like the OLMo series and Nvidia's Nemotron are much more valuable in some contexts, but they haven't quite cracked the level of performance that the best open-weights models are hitting. And I think that's why most startups are reaching for Chinese base LLMs when they want to tune custom models: the performance is better and they were never going to bother with pretraining anyway.

  • By mnkv 2026-03-1020:30

    This blog post describes the basic work of a research engineer and nothing more. The amount of surprise the author has seems to suggest they haven't really worked in ML for very long.

    Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.

  • By 2001zhaozhao 2026-03-116:27

    Open-weight AI is actually analogous to closed source, free shareware you can decompile and modify yourself and run on your computer or a cloud server of your choice.

    It's a clear distinction to proprietary AI, which is analogous to SaaS software controlled by a company that runs it on its own cloud, and owns your data.

    But it's still not open source.

HackerNews