BitNet: Inference framework for 1-bit LLMs

2026-03-1112:27364167github.com

Official inference framework for 1-bit LLMs. Contribute to microsoft/BitNet development by creating an account on GitHub.

License: MIT version

BitNet Model on Hugging Face

Try it out via this demo, or build and run it on your own CPU or GPU.

bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).

The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.

Latest optimization introduces parallel kernel implementations with configurable tiling and embedding quantization support, achieving 1.15x to 2.1x additional speedup over the original implementation across different hardware platforms and workloads. For detailed technical information, see the optimization guide.

performance_comparison

A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:

demo.mp4

This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.

❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.

  • python>=3.9
  • cmake>=3.22
  • clang>=18
    • For Windows users, install Visual Studio 2022. In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):

      • Desktop-development with C++
      • C++-CMake Tools for Windows
      • Git for Windows
      • C++-Clang Compiler for Windows
      • MS-Build Support for LLVM-Toolset (clang)
    • For Debian/Ubuntu users, you can download with Automatic installation script

      bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

  • conda (highly recommend)

Important

If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
# (Recommended) Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp

pip install -r requirements.txt
# Manually download the model and run with local path
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s
usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd]
                    [--use-pretuned]

Setup the environment for running inference

optional arguments:
  -h, --help            show this help message and exit
  --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}
                        Model used for inference
  --model-dir MODEL_DIR, -md MODEL_DIR
                        Directory to save/load the model
  --log-dir LOG_DIR, -ld LOG_DIR
                        Directory to save the logging info
  --quant-type {i2_s,tl1}, -q {i2_s,tl1}
                        Quantization type
  --quant-embd          Quantize the embeddings to f16
  --use-pretuned, -p    Use the pretuned kernel parameters
# Run inference with the quantized model
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]

Run inference

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Path to model file
  -n N_PREDICT, --n-predict N_PREDICT
                        Number of tokens to predict when generating text
  -p PROMPT, --prompt PROMPT
                        Prompt to generate text from
  -t THREADS, --threads THREADS
                        Number of threads to use
  -c CTX_SIZE, --ctx-size CTX_SIZE
                        Size of the prompt context
  -temp TEMPERATURE, --temperature TEMPERATURE
                        Temperature, a hyperparameter that controls the randomness of the generated text
  -cnv, --conversation  Whether to enable chat mode or not (for instruct models.)
                        (When this option is turned on, the prompt specified by -p will be used as the system prompt.)

We provide scripts to run the inference benchmark providing a model.

usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]  
   
Setup the environment for running the inference  
   
required arguments:  
  -m MODEL, --model MODEL  
                        Path to the model file. 
   
optional arguments:  
  -h, --help  
                        Show this help message and exit. 
  -n N_TOKEN, --n-token N_TOKEN  
                        Number of generated tokens. 
  -p N_PROMPT, --n-prompt N_PROMPT  
                        Prompt to generate text from. 
  -t THREADS, --threads THREADS  
                        Number of threads to use. 

Here's a brief explanation of each argument:

  • -m, --model: The path to the model file. This is a required argument that must be provided when running the script.
  • -n, --n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.
  • -p, --n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.
  • -t, --threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2.
  • -h, --help: Show the help message and exit. Use this argument to display usage information.

For example:

python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4  

This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.

For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:

python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M # Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128
# Prepare the .safetensors model file
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16 # Convert to gguf model
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16

A: This is an issue introduced in recent version of llama.cpp. Please refer to this commit in the discussion to fix this issue.

A: Before building the project, verify your clang installation and access to Visual Studio tools by running:

This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:

'clang' is not recognized as an internal or external command, operable program or batch file.

It indicates that your command line window is not properly initialized for Visual Studio tools.

• If you are using Command Prompt, run:

"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64

• If you are using Windows PowerShell, run the following commands:

Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"

These steps will initialize your environment and allow you to use the correct Visual Studio tools.


Page 2

NameName

You can’t perform that action at this time.


Read the original article

Comments

  • By giancarlostoro 2026-03-1113:199 reply

    One of the things I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers? I'm surprised something like Encyclopedia Britanica hasn't yet (afaik) tried to capitalize on AI by selling their data to LLMs and validating outputs for LLM companies, it would make a night and day difference in some areas I would think. Wikipedia is nice, but there's so much room for human error and bias there.

    • By andai 2026-03-1122:44

      Here's a short clip of Karpathy speaking on this subject.

      https://youtu.be/UldqWmyUap4

      Also this is the direction the small LLMs are moving in already. They are too small for general knowledge, but getting quite good at tool use (incl. Googling).

      Now we just need them to be very strict about what they know and don't know! (I think this is still an open problem, even with big ones.)

    • By intrasight 2026-03-1113:292 reply

      It's not so much a "minimally viable LLM" but rather an LLM that knows natural language well but knows nothing else. Like me - as an engineer who knows how to troubleshoot in general but doesn't know about a specific device like my furnace (recent example).

      And I don't think that LLM could just Google or check Wikipedia.

      But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.

      • By ramses0 2026-03-1117:52

        I asked this question a while back (the "only train w/ wikipedia LLM") and got pointed to the general-purpose "compression benchmarks" page: `https://www.mattmahoney.net/dc/text.html`

        While I understand some of the fundamental thoughts behind that comparison, it's slightly wonky... I'm not asking "compress wikipedia really well", but instead "can a 'model' reason its way through wikipedia" (and what does that reasoning look like?).

        Theoretically with wikipedia-multi-lang you should be able to reasonably nail machine-translation, but if everyone is starting with "only wikipedia" then how well can they keep up with the wild-web-trained models on similar bar chart per task performance?

        If your particular training technique (using only wikipedia) can go from 60% of SOTA to 80% of SOTA on "Explain why 6-degrees of Kevin Bacon is relevant for tensor operations" (which is interesting to plug into Google's AI => Dive Deeper...), then that's a clue that it's not just throwing piles of data at the problem, but instead getting closer to extracting the deeper meaning (and/or reasoning!) that the data enables.

      • By giancarlostoro 2026-03-1114:121 reply

        Correct! I know RAG is a thing, but I wish we could have "DLCs" for LLMs like image generation has LoRa's which are cheaper to train for than retraining the entire model, and provide more output like what you want. I would love to pop in the CS "LoRa or DLC" and ask it about functional programming in Elixir, or whatever.

        Maybe not crawl the web, but hit a service with pre-hosted, precurated content it can digest (and cache) that doesn't necessarily change often enough. You aren't using it for the latest news necessarily, but programming is mostly static knowledge a a good example.

        • By dpflug 2026-03-1222:50

          If I understand correctly, LoRa can be applied to LLMs

    • By embedding-shape 2026-03-1113:231 reply

      Your worry about Wikipedia is that there is "much room for human error and bias", yet earlier you seem to imply that a LLM that has access to the www somehow would have less human error and bias? Personally, I'd see it the other way around.

      • By giancarlostoro 2026-03-1114:10

        When GPT 3.5 became a thing, it had crawled a very nuanced set of websites, this is what I mean. You basically curate where it sources data from.

    • By krychu 2026-03-1210:12

      Unfortunately reasoning ability depends on (or is enabled by) information intake during training. A model will know better what to search for and how to interpret it if the information was part of the training. So there is a trade off. Still I think the question is a practical one. Perhaps there are ideas to focus training on a) reasoning / conceptual modeling and b) reliance on external memory (search etc.) rather than internal memorization.

    • By rablackburn 2026-03-125:06

      I feel like I should say "spoiler alert" but:

      > I often wonder is "what will be the minimally viable LLM" that can work from just enough information that if it googles the rest it can provide reasonable answers?

      It depends what that word "reasonable" means for your specific use-case ;)

    • By bee_rider 2026-03-1113:572 reply

      Isn’t that sort of what a RAG is? You’d need an LLM “smart” enough to turn natural-user prompts into searches, then some kind of search, then an LLM “smart” though to summarize the results.

      • By giancarlostoro 2026-03-1114:09

        Yeah, I think RAG is the idea that will lead us there, though its a little complicated, because for some subjects, say Computer Science, you need a little more than just "This is Hello World in Go" you might need to understand not just Go syntax on the fly, but more CS nuances that are not covered in one single simple document. The idea being having a model that runs fully locally on a phone or laptop with minimal resources. On the other hand, I can also see smaller models talking to larger models that are cheaper to run in the cloud. I am wondering if this is the approach Apple might take with Siri, specifically in order to retain user privacy as much as possible.

      • By andai 2026-03-1122:491 reply

        I remember reading tht hallucination is still a problem even with perfect context. You build a theoretical perfect RAG, give the LLM the exact correct information, and it will still make mistakes surprisingly often.

        • By Natfan 2026-03-1216:54

          this was my experience as of about 6 months ago, and i don't believe that hallucinating is a solved problem as of yet

    • By utopiah 2026-03-1113:23

      > validating outputs for LLM companies

      How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.

    • By uniq7 2026-03-1113:27

      Since Google Search already includes an AI summary, your minimally viable "LLM" can be just an HTTP GET call

    • By thinkingtoilet 2026-03-1113:441 reply

      Wikipedia has proven to be as accurate as encyclopedias for decades now. Also, I'm betting AI companies have illegally trained their models on the Encyclopedia Britanica's data by now.

      • By naasking 2026-03-1116:39

        I think the idea is to train a small, minimal LLM thinking model that can run on edge devices, but that has very little knowledge embedded in its weights, and so performs a sort of RAG to Encylopedia Britannica to ground answers to user queries.

  • By htk 2026-03-1120:121 reply

    So Microsoft is actually using 2 bits instead of 1.58. In this case they could represent -1, 0, 1, 2. As inhibitory synapses account for 20%-30%, this could map well to how biological brains are structured.

    Does that make sense?

    • By hrimfaxi 2026-03-1121:242 reply

      Can you explain your third statement?

      > As inhibitory synapses account for 20%-30%, this could map well to how biological brains are structured.

      • By DoctorOetker 2026-03-1122:521 reply

        In the human brain most synapses are indeed excitatory, while a minority is inhibitory.

        No concise HN comment will give you a complete picture of whats currently known about the human brain, so a platitude necessarily follows:

        We call the nearly touching interfaces between neurons synapses, small packets / droplets of neurotransmitter are sent across this interface from the source to the target neuron. Such signals can be excitatory (promote the probability of excitation of the target firing soon) or inhibitory (inhibits the probability of the target firing soon). There are 2 types of sensitive areas on your average neuron: the dendrites (long branching tentacles, that receive excitatory signals) and the cell body where all the signals are accumulated to a local instantaneous "sum" is also sensitive to synaptic activation, but the synapses on the cell body are inhibitory, when sufficiently inhibited the neuron will refuse to fire its axons, so the inhibitory synapses on the cell body can gate the cumulative signal and prevent it from triggering this neuron temporarily. If the neuron does fire, this propagates along the axons (another type of branching tentacles, which lead to yet other neurons, sometimes touching them excitatorily at their dendrite, sometimes touching a neuron inhibitorily at their cell body.

        I hope that helped?

        • By vermilingua 2026-03-120:49

          It is really truly incredible that this mess of microscopic meat plumbing encodes everything we see, think, and do. Terrifying and amazing all at once.

  • By herf 2026-03-1115:571 reply

    https://arxiv.org/pdf/2310.11453 The original paper [fig 1, bottom-right] seems to say it needs about 4-5x the parameters of a fp16 model. You can build it and run some models, but the selection is limited because it has to be trained from scratch. I imagine inference speed is faster compared with modern PTQ (4- and 8-bit quants) though.

HackerNews