I have reimplemented Stable Diffusion 3.5 from scratch in pure PyTorch

2025-06-1413:5648177github.com

A reimplementation of Stable Diffusion 3.5 in pure PyTorch - yousef-rafat/miniDiffusion

SD3 Diagram

miniDiffusion is a reimplementation of the Stable Diffusion 3.5 model in pure PyTorch with minimal dependencies. It's designed for educational, experimenting, and hacking purposes. It's made with the mindset of having the least amount of code necessary to recreate Stable Diffusion 3.5 from scratch, with only ~2800 spanning from VAE to DiT to the Train and Dataset scripts.

-Files: The main Stable Diffusion model code is located in dit.py, dit_components.py, and attention.py. The dit.py file contains the main model, dit_components.py contains the embedding, normalization, patch embedding, and help functions for the DiT code, and attention.py contains the Joint Attention implementation. The noise.py is where the Euler Scheduler is located for solving the ODE of Rectified Flow.

The text encoders are in t5_encoder.py and clip.py, and their tokenizers are both in tokenizer.py. The metrics.py implements the Fréchet inception distance (FID).

The common.py is a place for helper functions for training, the common_ds.py is an implementation of an iterable dataset that converts image data to trainable data for the DiT model.

-Folders: The model folder saves the model's checkpoint and logs after training. The encoders folder saves other modules' checkpoints (e.g., VAE, CLIP).

⚠️ Warning: This repository still has experimental features and requires more testing.

  • Implementations of VAE, CLIP, and T5 Text Encoders
  • Implementation of Byte-Pair & Unigram tokenizers
  • Multi-Modal Diffusion Transformer Model
  • Flow-Matching Euler Scheduler
  • Logit-Normal Sampling
  • Joint Attention

Get the repo

git clone "https://github.com/yousef-rafat/miniDiffusion"

Install Dependencies

pip install -r requirements.txt

Install Checkpoints for Models

  • Add a Hugging Face Token in get_checkpoints.py before running the script.
python3 encoders/get_checkpoints.py

This project is under the MIT License and is made for educational and experimental purposes.


Read the original article

Comments

  • By liuliu 2025-06-1415:321 reply

    If you are interested in this: Flux reference implementation is very minimalistic: https://github.com/black-forest-labs/flux/tree/main/src/flux

    The minRF project is very easy to start with training small diffusion models with rectified flow: https://github.com/cloneofsimo/minRF

    Also, the reference implementation of SD 3.5 is actually minimalistic too: https://github.com/Stability-AI/sd3-ref

    • By doctorpangloss 2025-06-1417:523 reply

      Reference implementations are unmaintained and buggy.

      For example https://github.com/huggingface/transformers/issues/27961 OpenAI's tokenizer for CLIP is buggy, it's a reference implementation, it isn't the one they used for training, and the problems with it go unsolved and get copied endlessly by other projects.

      What about Flux? They don't say it was used for training, it wasn't, there are bugs with it that break cudagraphs or similar that aren't that impactful. On the other hand, it uses CLIP reference, and CLIP reference is buggy, so this is buggy...

      • By liuliu 2025-06-1423:261 reply

        Congrats on finding a bug!

        However, the keyword here is training / inference divergence. Unfortunately, nobody is going to spend multi-million to retrain a model, so our reimplementation needs to be bug-to-bug correct to use the trained weights properly. That's why the reference implementations are essential because it is from the original model trainers so you have the best "bet" on matching the training code properly.

        To give you some concrete example of bugs we needs to maintain:

        1. In SDXL, they use OpenClipG for text encoding, but wrongfully uses 0 as padding tokens (corresponding to symbol "!") whereas even for OpenClipG its own training, the endoftext token was used as padding token. However, if you switching SDXL to use endoftext token as padding token, due to training / inference divergence, you get subpar generated images.

        2. In FLUX, we mainly use T5 as text encoder. However, T5 usually used as encoder with mask to exactly the same input length, to avoid extended impact of padding tokens. In FLUX, we don't apply mask for T5 text encoding, hence intuitively causing padding token to take more effect than it should. Again, "fixing" this bug without retraining you will get subpar generated images.

        There are many examples like this, some are easier to fix some are not (HiDream uses a different ODE solver that is different than what we usually do for rectified flow, hence you need to negate its prediction to be compatible with existing samplers, but this is "easier to fix").

        TL;DR: Yes, there are bugs in software, but we better to maintain bug-to-bug compatibility than trying to "fix" it, hence highlight the importance of a "done" reference implementation, rather than a usual "active" implementations in software industry otherwise.

        (I maintain the most complete reimplementation of SoTA media generation models in Swift: https://github.com/drawthingsai/draw-things-community/tree/m.... So I tend to think that I know one or two about "reimplementation from scratch".)

        • By doctorpangloss 2025-06-151:26

          I think if you read the issue carefully you would understand that the CLIP implementation in transformers and as published by OpenAI is wrong and does not match their trained model code; and that doing the fix I suggest, empirically for me and in theory, improves results.

      • By 42lux 2025-06-1418:041 reply

        You can disable clip l on flux without a loss in quality. You are also making an elephant out of a fly. CLIP is used everywhere.

        • By doctorpangloss 2025-06-1422:481 reply

          Consider another interpretation: CLIP L in Flux can be disabled without a loss in quality because the way it is used is buggy!

          • By 42lux 2025-06-1512:381 reply

            oh lord.

            • By doctorpangloss 2025-06-1518:431 reply

              The truth is that the CLIP conditioning in Flux works well for Dreambooth style fine tuning where tokenization bugs can be acute, but not so severe as to cause the low impact of CLIP on their dev model. It is likely more impactful on their pro / max models but only BFL could say so.

              • By 42lux 2025-06-1622:281 reply

                That's absolute nonsense.

                • By doctorpangloss 2025-06-1817:12

                  okay well, there are a few things that are known to be true: (1) clip's tokenizer in diffusers, the reference source in BFL's repo, and in openai's repo, is buggy (2) many clip prompts are observed to have a low impact in the flux dev and schnell models. it is very likely to be true that (1) the tokenizer in the BFL reference source and openai's repo does not match the tokenizer used in training openai's clip or the text conditioning for any of the flux checkpoints (2) the guidance and timestep distillation play a role in weakening the role of clip (3) it is practical to fine tune clip on more image-caption pairs. if you care about fine tuning, the tokenization bugs matter. everything else is hard to prove.

      • By electroglyph 2025-06-1421:021 reply

        It shouldn't take a lot of effort to fix a tokenizer...

        • By doctorpangloss 2025-06-151:27

          People are a little too blinded by the insight porn of matching buggy behavior to just read and comprehend the issue. They can’t engage with the simpler and more pornographic insight porn that the reference implementations are buggy and do not match the trained artifacts.

  • By reedlaw 2025-06-1415:001 reply

    I'm not sure what this means. If it means the Stable Diffusion 3.5 model, why is it fetching that here: https://github.com/yousef-rafat/miniDiffusion/blob/main/enco...

    The training dataset is very small, only including fashion-related pictures: https://github.com/yousef-rafat/miniDiffusion/tree/main/data...

    • By yousef_g 2025-06-1415:062 reply

      The dataset is for trying out fine-tuning of the diffusion model. It's a reimplementation of SD3 by writing the code from scratch again, but the weights are taken from HuggingFace due to hardware constraints on my part.

      • By reedlaw 2025-06-1415:41

        So this implements SD3 inference and fine-tuning?

      • By jatins 2025-06-153:193 reply

        > It's a reimplementation of SD3 by writing the code from scratch again, but the weights are taken from HuggingFace due to hardware constraints on my part.

        Could you clarify what you mean by this part -- if the weights are taken from HF then what's the implementation for?

        • By MoonGhost 2025-06-156:05

          My guess the weights from HF are used as initial state for the model because full training is too expensive. Then small dataset is used to train in further for short time. Which is fine tuning. Together it shows that model is 1) compatible 2) trainable. In theory it can be trained from scratch on big dataset. I didn't look in the code yet so the questions are: 1) can it be trained in parallel? 2) resources required for training?

          Anyway, I may try to train it on limited specialized dataset...

        • By elbear 2025-06-1511:01

          The model consists of its architecture which is expressed as code, and its knowledge, which is gained through training.

        • By montebicyclelo 2025-06-157:44

          > if the weights are taken from HF then what's the implementation for

          The weights are essentially a bunch of floating point numbers, (grouped into tensors). The code says what operations to do with the weights. E.g. say you load matrix W from the weights, you could do `y = W @ x`, or `y = W.T @ x`, or `y = W @ W @ x` etc.

  • By refulgentis 2025-06-1418:342 reply

    I'm embarrassed to ask: can someone elaborate on, say, what we have now that we didn't have before the repo existed?

    I have studiously avoided making models, though I've been adjacent to their output for years now... I think the root of my confusion is I kinda assumed there was already PyTorch based scripts for inference / training. (I assumed _at least_ inference scripts were released with models, and kinda figured fine-tuning / training ones were too)

    So then I'm not sure if I'm just looking at a clean room / dirty room rewrite of those. Or maybe everyone is using "PyTorch" but it's usually calling into CUDA/C/some proprietary thingy that is much harder to grok than a pure PyTorch impl?

    Anyways, these arent great guesses, so I'll stop myself here. :)

    • By _tqr3 2025-06-1420:001 reply

      Stability AI, creators of Stable Diffusion models release their products under own Stability AI Community License which is not "free" like MIT license. You are not allowed to modify the weights in certain ways.

      This package is basically running the model (inference) and maybe fine tuning it using existing AI weights. A great way to learn but still could run into same licensing issue.

      • By refulgentis 2025-06-1420:402 reply

        You can't finetune SD 3.5!?

        I thought the community license stuff was about keeping people from using it in prod and charging for it without Stability getting at least a small taste.

        This sucks.

        I haven't been keeping up with gooner squad on Civit, but I did have some understanding SD was less popular, but I thought it was just because 3.5 came far too long after Flux with too little, if any, quality increase to be worth building new scaffolding for.

        • By fc417fc802 2025-06-153:35

          > You can't finetune SD 3.5!?

          They don't want you finetuning it in specific ways that might make them look bad by association.

        • By djhn 2025-06-165:55

          So, out of interest, what are good TLDR sources for following the gooner scene? Like some highlights newsletter, subreddit, podcast, youtube channel or something? I’m interested in keeping up with their methods, not their results and output.

    • By rockemsockem 2025-06-1422:091 reply

      I believe this is the main piece

      > with minimal dependencies

      I haven't tried running SD 3.5 specifically, but it's built on hugging face libraries which I personally always find to be a mess of dependencies that make it really hard to setup without the exact configuration the original developers used (which is often not provided in enough detail to actually work). This makes it pretty hard to run certain models especially if it's a few months/years after the original release.

      For example this appears to be the requirements for the stability AI reference implementation for SD3.5 and there are no versions specified and it includes "transformers" which is just an enormous library.

      https://github.com/Stability-AI/sd3.5/blob/main/requirements...

      • By refulgentis 2025-06-1422:20

        Ah, tyvm, that maps well onto my knowledge set, I have a ONNX inference wrapper written in Dart. However, I have never been able to leverage transformers.js ONNX demo code, i.e. have a reference to port to Dart.

        IIRC it is written in an abstraction layer that supports a transformers-like API surface. This also makes it opaque to figure out what you're actually passing to the model, adding a Python dep mess on top of that...woo boy.

HackerNews