Hacker News

How to run Qwen 3.5 locally

2026-03-0723:32490168unsloth.ai

Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!Qwen3.5 is Alibaba’s new model family, including Qwen3.5…

Show article

Run the new Qwen3.5 LLMs including Medium: Qwen3.5-35B-A3B, 27B, 122B-A10B, Small: Qwen3.5-0.8B, 2B, 4B, 9B and 397B-A17B on your local device!

Qwen3.5 is Alibaba’s new model family, including Qwen3.5-35B-A3B, 27B, 122B-A10B and 397B-A17B and the new Small series: Qwen3.5-0.8B, 2B, 4B and 9B. The multimodal hybrid reasoning LLMs deliver the strongest performances for their sizes. They support 256K context across 201 languages, have thinking + non-thinking, and excel in agentic coding, vision, chat, and long-context tasks. The 35B and 27B models work on a 22GB Mac / RAM device. See all GGUFs here.

Mar 5 Update: Redownload Qwen3.5-35B, 27B, 122B and 397B.

All GGUFs now updated with an improved quantization algorithm.
All use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
Tool-calling improved following our chat template fixes. Fix is universal and applies to any Qwen3.5 format and any uploader.
We're retiring MXFP4 layers from 3 Qwen3.5 GGUFs: Q2_K_XL, Q3_K_XL and Q4_K_XL.

All uploads use Unsloth Dynamic 2.0 for SOTA quantization performance - so 4-bit has important layers upcasted to 8 or 16-bit. Thank you Qwen for providing Unsloth with day zero access. You can also fine-tune Qwen3.5 with Unsloth.

35B-A3B 27B 122B-A10B 397B-A17B Fine-tune Qwen3.5 0.8B • 2B • 4B • 9B

Table: Inference hardware requirements (units = total memory: RAM + VRAM, or unified memory)

For best performance, make sure your total available memory (VRAM + system RAM) exceeds the size of the quantized model file you’re downloading. If it doesn’t, llama.cpp can still run via SSD/HDD offloading, but inference will be slower.

Between 27B and 35B-A3B, use 27B if you want slightly more accurate results and can't fit in your device. Go for 35B-A3B if you want much faster inference.

Maximum context window: 262,144 (can be extended to 1M via YaRN)
presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance
Adequate Output Length: 32,768 tokens for most queries

If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help.

As Qwen3.5 is hybrid reasoning, thinking and non-thinking mode have different settings:

Precise coding tasks (e.g. WebDev)

repeat penalty = disabled or 1.0

Thinking mode for general tasks:

Thinking mode for precise coding tasks:

Instruct (non-thinking) mode settings:

repeat penalty = disabled or 1.0

To disable thinking / reasoning, use --chat-template-kwargs '{"enable_thinking":false}'

If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"

Use 'true' and 'false' interchangeably.

For Qwen3.5 0.8B, 2B, 4B and 9B, reasoning is disabled by default. To enable it, use: --chat-template-kwargs '{"enable_thinking":true}'

Instruct (non-thinking) for general tasks:

Instruct (non-thinking) for reasoning tasks:

Qwen3.5 Inference Tutorials:

Because Qwen3.5 comes in many different sizes, we'll be using Dynamic 4-bit MXFP4_MOE GGUF variants for all inference workloads. Click below to navigate to designated model instructions:

Qwen3.5-35B-A3B 27B 122B-A10B 397B-A17B Small (0.8B • 2B • 4B • 9B)LM Studio

Unsloth Dynamic GGUF uploads:

presence_penalty = 0.0 to 2.0 default this is off, but to reduce repetitions, you can use this, however using a higher value may result in slight decrease in performance.

Currently no Qwen3.5 GGUF works in Ollama due to separate mmproj vision files. Use llama.cpp compatible backends.

For this guide we will be utilizing Dynamic 4-bit which works great on a 24GB RAM / Mac device for fast inference. Because the model is only around 72GB at full F16 precision, we won't need to worry much about performance. GGUF: Qwen3.5-35B-A3B-GGUF

For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_M) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

Then run the model in conversation mode:

Qwen3.5 Small (0.8B • 2B • 4B • 9B)

For Qwen3.5 0.8B, 2B, 4B and 9B, reasoning is disabled by default. To enable it, use: --chat-template-kwargs '{"enable_thinking":true}'

On Windows use: --chat-template-kwargs "{\"enable_thinking\":true}"

For the Qwen3.5 Small series, because they're so small, all you need to do is change the model name in the scripts to desired variant. For this specific guide we'll be using the 9B parameter variant. To run them all in near full precision, you'll just need 12GB of RAM / VRAM / unified memory device. GGUFs:

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. The model has a maximum of 256K context length.

Follow one of the specific commands below, according to your use-case:

To use another variant other than 9B, you can change the '9B' to: 0.8B, 2B or 4B etc.

Thinking mode (disabled by default)

Qwen3.5 Small models disable thinking by default. Use llama-server to enable it.

General tasks:

To use another variant other than 9B, you can change the '9B' to: 0.8B, 2B or 4B etc.

Non-thinking mode is already on by default

General tasks:

Reasoning tasks:

Then run the model in conversation mode:

For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM / Mac device for fast inference. GGUF: Qwen3.5-27B-GGUF

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

Then run the model in conversation mode:

For this guide we will be utilizing Dynamic 4-bit which works great on a 70GB RAM / Mac device for fast inference. GGUF: Qwen3.5-122B-A10B-GGUF

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

Follow one of the specific commands below, according to your use-case:

Thinking mode:

Precise coding tasks (e.g. WebDev):

General tasks:

Non-thinking mode:

General tasks:

Reasoning tasks:

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose MXFP4_MOE (dynamic 4bit) or other quantized versions like UD-Q4_K_XL . We recommend using at least 2-bit dynamic quant UD-Q2_K_XL to balance size and accuracy. If downloads get stuck, see: Hugging Face Hub, XET debugging

Then run the model in conversation mode:

Qwen3.5-397B-A17B is in the same performance tier as Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2. The full 397B checkpoint is ~807GB on disk, but via Unsloth's 397B GGUFs you can run:

3-bit: fits on 192GB RAM systems (e.g., a 192GB Mac)
4-bit (MXFP4): fits on 256GB RAM. Unsloth 4-bit dynamic UD-Q4_K_XL is ~214GB on disk - loads directly on a 256GB M3 Ultra
Runs on a single 24GB GPU + 256GB system RAM via MoE offloading, reaching 25+ tokens/s
8-bit needs ~512GB RAM/VRAM

See 397B quantization benchmarks on how Unsloth GGUFs perform.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

Follow this for thinking mode:

Follow this for non-thinking mode:

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

For this guide, we'll be using LM Studio, a unified UI interface for running LLMs. The '💡Thinking' and 'Non-thinking' toggle may not appear by default so we'll need some extra steps to get it working.

Download LM Studio for your device. Then open Model Search, search for 'unsloth/qwen3.5', and download the GGUF (quant) that you desire.

Thinking Toggle instructions: After downloading, Open your Terminal / PowerShell and try: lms --help. Then if LM Studio appears normally with many commands, run:

This will get a yaml file which enables your GGUF to have the '💡Thinking' and 'Non-thinking' toggle appear. You can change 4b to the desired quant you'd like to have.

Otherwise, you can go to our LM Studio page and download the specific yaml file.

Restart LM Studio, then load your downloaded model (with the specific thinking toggle you downloaded). You should now see the Thinking toggle enabled. Don't forget to set the correct parameters.

🦙 Llama-server serving & OpenAI's completion library

To deploy Qwen3.5-397B-A17B for production, we use llama-server In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

🤔 How to enable or disable reasoning & thinking

For the below commands, you can use 'true' and 'false' interchangeably. To have Think toggle for LM Studio, read our guide.

To disable thinking / reasoning, use within llama-server:

If you're on Windows or Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"

To enable thinking / reasoning, use within llama-server:

If you're on Windows or Powershell, use: --chat-template-kwargs "{\"enable_thinking\":true}"

For Qwen3.5 0.8B, 2B, 4B and 9B, reasoning is disabled by default. To enable it, use: --chat-template-kwargs '{"enable_thinking":true}'

And on Windows or Powershell: --chat-template-kwargs "{\"enable_thinking\":true}"

As an example for Qwen3.5-9B to enable thinking (default is disabled):

And then in Python:

👨‍💻 OpenAI Codex & Claude Code

To run the model via local coding agentic workloads, you can follow our guide. Just change the model name 'GLM-4.7-Flash' to your desired 'Qwen3.5' variant and ensure you follow the correct Qwen3.5 parameters and usage instructions. Use the llama-server we just set up just then.

After following the instructions for Claude Code for example you will see:

We can then ask say Create a Python game for Chess :

🔨Tool Calling with Qwen3.5

See Tool Calling Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

After launching Qwen3.5 via llama-server like in Qwen3.5 or see Tool Calling Guide for more details, we then can do some tool calls.

We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits. We did over 150 KL Divergence benchmarks, totally 9TB of GGUFs. We uploaded all research artifacts. We also fixed a tool calling chat template bug (affects all quant uploaders)

All GGUFs now updated with an improved quantization algorithm.
All use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
Qwen3.5-35B-A3B GGUFs are updated to use new fixes (112B, 27B still converting, re-download once they are updated)
99.9% KL Divergence shows SOTA on Pareto Frontier for UD-Q4_K_XL, IQ3_XXS & more.
Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE.

READ OUR DETAILED QWEN3.5 ANALYSIS + BENCHMARKS HERE:

Qwen3.5 GGUF Benchmarks

Qwen3.5-397B-A17B Benchmarks

Benjamin Marie (third-party) benchmarked Qwen3.5-397B-A17B using Unsloth GGUFs on a 750-prompt mixed suite (LiveCodeBench v6, MMLU Pro, GPQA, Math500), reporting both overall accuracy and relative error increase (how much more often the quantized model makes mistakes vs. the original).

Key results (accuracy; change vs. original; relative error increase):

UD-Q4_K_XL: 80.5% (−0.8 points; +4.3% relative error increase)
UD-Q3_K_XL: 80.7% (−0.6 points; +3.5% relative error increase)

UD-Q4_K_XL and UD-Q3_K_XL stay extremely close to the original, well under a 1-point accuracy drop on this suite, which Ben insinuates that you can sharply reduce memory footprint (~500 GB less) with little to no practical loss on the tested tasks.

How to choose: Q3 scoring slightly higher than Q4 here is completely plausible as normal run-to-run variance at this scale, so treat Q3 and Q4 as effectively similar quality in this benchmark:

Pick Q3 if you want the smallest footprint / best memory savings
Pick Q4 if you want a slightly more conservative option with similar results

All listed quants utilize our dynamic metholodgy. Even UD-IQ2_M uses a the same methodology of dynamic however the conversion process is different to UD-Q2-K-XL where K-XL is usually faster than UD-IQ2_M even though it's bigger, so that is why UD-IQ2_M may perform better than UD-Q2-K-XL.

Qwen3.5-35B-A3B, 27B and 122B-A10B Benchmarks

Qwen3.5-4B and 9B Benchmarks

Qwen3.5-397B-A17B Benchmarks

Last updated 1 day ago

Read the original article

Curiositry

Karma: 2586

@Hacker__News
@hacker._news

Comments

By moqizhengz 2026-03-086:038 reply

Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.

By smokel 2026-03-089:152 reply

> This outperforms the majority of online llm services

I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.

(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)

By moffkalast 2026-03-0810:044 reply

Obviously it's not going to be of a paid tier 2T sized SOTA model quality, but it can probably roughly match Haiku at the very least. And for tasks that aren't super complex that's already enough.

Personally though, I find Qwen useless for anything but coding tasks because if its insufferable sycophancy. It's like 4o dialed up to 20, every reply starts with "You are absolutely right" with zero self awareness. And for coding, only the best model available is usually sensible to use otherwise it's just wasted time.

By Anduia 2026-03-0810:136 reply

That's why I start any prompt to Qwen 3.5 with:

persona: brief rude senior

By amelius 2026-03-0813:162 reply

I'm using:

persona: drunken sailor

Because then at least the tone matches the quality of the output and I'm reminded of what I can expect.

By moffkalast 2026-03-0820:501 reply

But then what do you do with it early in the morning?

By amelius 2026-03-0821:10

For starters, shave his belly with a rusty razor, obviously ;)

By dlcarrier 2026-03-0816:391 reply

Does it tend to break out into sea shanties?

By drob518 2026-03-0818:301 reply

Yo, ho, ho, and a bottle of rum.

By yunnpp 2026-03-0819:45

https://www.youtube.com/watch?v=C_k8wYuk8PQ

By em500 2026-03-0813:311 reply

This also works

persona: emotionless vulcan

By dlcarrier 2026-03-0816:42

Does "persona: air traffic controller" work?

If I could set up a voice assistant that actually verifies commands, instead of assuming it heard everything correctly 100% of the time, it might even be useful.

By 9wzYQbTYsAIc 2026-03-0815:471 reply

persona: fair witness

https://fairwitness.bot/

By Chris2048 2026-03-0820:191 reply

You just paste in that YAML? Is this an official llm config format that is parsed out?

By 9wzYQbTYsAIc 2026-03-1017:31

Yeah, just paste it in there - the LLM will figure it out. Play with it if you want to tweak the formatting - you could try JSON instead, but for readability I went with YAML.

By ranger_danger 2026-03-094:36

wow I had no idea you could do that. this changes everything for me.

By varispeed 2026-03-0814:59

persona: party delegate in a rural province who doesn't want to be there

By lemonginger 2026-03-0811:33

gamechanger

By andai 2026-03-1119:421 reply

>for coding, only the best model available is usually sensible to use otherwise it's just wasted time.

I had the opposite experience. Gave a small model and a big model the same 3 tasks. Small model was done in 30 sec. Large model took 90 sec 3x longer and cost 3x more. Depending on the task, the benchies just tell you how much you are over-paying and over-waiting.

By wehadit 2026-03-1220:37

If you use the models like we execute coding tasks, older models outperform latest models. There's this prep tax that happens even before we start coding, i.e., extract requirements from tools, context from code, comments and decisions from conversations, ACs from Jira/Notion, stitch them together, design tailored coding standards and then code. If you automate the prep tax, the generated code is close to production ready code and may require 1-2 iterations max. I gave it a try and compared the results and found the output to be 92% accurate while same done on Claude Code gave 68% accuracy. Prep tax is the cue here

By itsTyrion 2026-03-121:05

oh? I used it in t3 chat before, with traits `concise` `avoid unnecessary flattery/affirmation/praise` `witty` `feel free to match potential user's sarcasm`

and it does use that sarcasm permission at times (I still dislike the way it generally communicates)

By ggregoire 2026-03-0814:11

> I find Qwen useless for anything but coding tasks because if its insufferable sycophancy

We use Qwen at work since 2.0 for text/image/video analysis (summarization, categorization, NER, etc), I think it's impressive. We ask for JSON and always ask "do not explain your response".

By segmondy 2026-03-0816:56

You can replace Sonnet and Opus with local models, you just need to run the larger ones.

By throwdbaaway 2026-03-087:024 reply

There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.

By codemog 2026-03-087:545 reply

Can someone explain how a 27B model (quantized no less) ever be comparable to a model like Sonnet 4.0 which is likely in the mid to high hundreds of billions of parameters?

Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.

By alecco 2026-03-0815:15

AFAIK post-training and distillation techniques advanced a lot in the past couple of years. SOTA big models get new frontier and within 6 months it trickles down to open models with 10x less parameters.

And mind the source pre-training data was not made/written for training LLMs, it's just random stuff from Internet, books, etc. So there's a LOT of completely useless an contradictory information. Better training texts are way better and you can just generate & curate from those huge frontier LLMs. This was shown in the TinyStories paper where GPT-4 generated children's stories could make models 3 orders of magnitude smaller achieve quite a lot.

This is why the big US labs complain China is "stealing" their work by distilling their models. Chinese labs save many billions in training with just a bunch of accounts. (I'm just stating what they say, not giving my opinion).

By otabdeveloper4 2026-03-088:102 reply

There's diminishing returns bigly when you increase parameter count.

The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.

Anyways your perception of a model's "quality" is determined by careful post-training.

By codemog 2026-03-088:511 reply

Interesting. I see papers where researchers will finetune models in the 7 to 12b range and even beat or be competitive with frontier models. I wish I knew how this was possible, or had more intuition on such things. If anyone has paper recommendations, I’d appreciate it.

By stavros 2026-03-0810:081 reply

They're using a revolutionary new method called "training on the test set".

By drob518 2026-03-0818:341 reply

So, curve fitting the training data? So, we should expect out of sample accuracy to be crap?

By stavros 2026-03-0818:47

Yeah, that's usually what tends to happen with those tiny models that are amazing in benchmarks.

By zozbot234 2026-03-088:23

More parameters improves general knowledge a lot, but you have to quantize more in order to fit in a given amount of memory, which if taken to extremes leads to erratic behavior. For casual chat use even Q2 models can be compelling, agentic use requires more regularization thus less quantized parameters and lowering the total amount to compensate.

By spwa4 2026-03-089:24

The short answer is that there are more things that matter than parameter count, and we are probably nowhere near the most efficient way to make these models. Also: the big AI labs have shown a few times that internally they have way more capable models

By girvo 2026-03-089:392 reply

Considering the full fat Qwen3.5-plus is good, but barely Sonnet 4 good in my testing (but incredibly cheap!) I doubt the quantised versions are somehow as good if not better in practice.

By rustyhancock 2026-03-0810:311 reply

I think it depends on work pattern.

Many do not give Sonnet or even Opus full reign where it really pushes ahead of over models.

If you're asking for tightly constrained single functions at a time it really doesn't make a huge difference.

I.e. the more vibe you do the better you need the model especially over long running and large contexts. Claude is heading and shoulders above everyone else in that setting.

By girvo 2026-03-0810:44

>I.e. the more vibe you do the better you need the model especially over long running and large contexts

For sure, but the coolest thing about qwen3.5-plus is the 1mil context length on a $3 coding plan, super neat. But the model isn't really powerful enough to take real advantage of it I've found. Still super neat though!

By stavros 2026-03-0810:071 reply

When you say Sonnet 4, do you mean literally 4, or 4.6?

By girvo 2026-03-0810:421 reply

It's not as capable as Sonnet 4.6 in my usage over the past couple days, through a few different coding harnesses (including my own for-play one[0], that's been quite fun).

[0] https://github.com/girvo/girvent/

By dr_kiszonka 2026-03-0812:353 reply

What is the benefit of writing your own harness? I am asking because I need to get better at using AI for programming. I have used Cursor, Gemini CLI, Antigravity quite a bit and have had a lot of difficulties getting them do what I want. They just tend to "know better."

By everforward 2026-03-0815:35

I’m not an expert but I started with smaller tasks to get a feel for how to phrase things, what I need to include. It’s more manageable to manually fix things it screwed up than giving it full reign.

You may want to look at the AGENTS.md file too so you can include your stock style things if it’s repeatedly screwing up in the same way.

By girvo 2026-03-0820:47

Purely as an exercise to see how they operate, and understand them better. Then additionally because I was curious how much better one could make something like qwen3.5-plus with its 1 mil context window despite its weaker base behaviour, if I was to give it something very focused on what I want from it

The Pi framework is probably right up your alley btw! Very extensible

By newswasboring 2026-03-0813:28

I think it's the same instinct as making your own Game Engine. You start off either because you want to learn how they work or because you think your game is special and needs its own engine. Usually, it's a combination of both.

By revolvingthrow 2026-03-089:232 reply

It doesn’t. I’m not sure it outperforms chatgpt 3

By BoredomIsFun 2026-03-0810:22

You are not being serrious, are you? even 1.5 years old Mistral and Meta models outperform ChatGPT 3.

By gunalx 2026-03-0810:15

3 not 3.5? I think I would even prefer the qwen3.5 0.8b over GPT 3.

By zozbot234 2026-03-088:041 reply

With MoE models, if the complete weights for inactive experts almost fit in RAM you can set up mmap use and they will be streamed from disk when needed. There's obviously a slowdown but it is quite gradual, and even less relevant if you use fast storage.

By htrp 2026-03-0820:13

any good packages you recommend for this?

By teaearlgraycold 2026-03-087:414 reply

Qwen3.5 35B A3B is much much faster and fits if you get a 3 bit version. How fast are you getting 27B to run?

On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.

By throwdbaaway 2026-03-089:011 reply

Using ik_llama.cpp to run a 27B 4bpw quant on a RTX 3090, I get 1312 tok/s PP and 40.7 tok/s TG at zero context, dropping to 1009 tok/s PP and 36.2 tok/s TG at 40960 context.

35B A3B is faster but didn't do too well in my limited testing.

By ranger_danger 2026-03-095:03

with regular llama.cpp on a 3070ti I get 60tok/s TG with the 9B model, it's quite impressive.

By ece 2026-03-088:03

The 27B is rated slightly higher for SWE-bench.

By andai 2026-03-1119:53

27B needs less memory and does better on benchmarks, but 35B-A3B seems to run roughly twice as fast.

By ranger_danger 2026-03-094:58

Don't sleep on the 9B version either, I get much faster speeds and can't tell any difference in quality. On my 3070ti I get ~60tok/s with it, and half that with the 35B-A3B.

By ljosifov 2026-03-0811:381 reply

Say more please if you can. How/why is ik_llama.cpp faster then mainline, for the 27B dense? I'd like to be able to run 27B dense faster on a 24GB vram gpu, and also on an M2 max.

By ac29 2026-03-0815:37

ik_llama.cpp was about 2x faster for CPU inference of Qwen3.5 versus mainline until yesterday. Mainline landed a PR that greatly increased speed for Qwen3.5, so now ik_llama.cpp is only 10% faster on token generation.

By the_duke 2026-03-0810:462 reply

What context length and related performance are you getting out if this setup?

At least 100k context without huge degradation is important for coding tasks. Most "I'm running this locally" reports only cover testing with very small context.

By Aurornis 2026-03-0816:58

Long context degradation is a problem with the Qwen3.5 models for me. They have some clever tricks to accelerate attention that favor more recent context.

The models can be frustrating to use if you expect long contexts to behave like they do on SOTA models. In my trials I could give them strict instructions to NOT do something and they would follow it for a short time before ignoring my prompt and doing the things I told it not to do.

By vardalab 2026-03-0813:53

Q4 quants on 32G VRAM gives you 131K context for 35BA3B and 27B models who are pretty capable. On 5090 one gets 175 tg and ~7K pp with 35BA3B, 27B isaround 90 tg. So speed is awesome. Even Strix 395 gives 40 tk/s and 256K context. Pretty amazing, there is a reason people are excited about qwen 3.5

By lukan 2026-03-086:591 reply

What exact model are you using?

I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?

By vasquez 2026-03-088:17

It depends on the task, but you generally want some context. These models can do things like OCR and summarize a pdf for you, which takes a bit of working memory. Even more so for coding CLIs like opencode-ai, qwen code and mistral ai.

Inference engines like llama.cpp will offload model and context to system ram for you, at the cost of performance. A MoE like 35B-A3B might serve you better than the ones mentioned, even if it doesn't fit entirely on the GPU. I suggest testing all three. Perhaps even 122-A10B if you have plenty of system ram.

Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).

Larger models tolerate lower quants better than small ones, 27B might be usable at 3 bpw where 9B or 4B wouldn't. You can also quantize the context. On llama.cpp you'd set the flags -fa on, -ctk x and ctv y. -h to see valid parameters. K is more sensitive to quantization than V, don't bother lowering it past q8_0. KV quantization is allegedly broken for Qwen 3.5 right now, but I can't tell.

By yangikan 2026-03-086:225 reply

Do you point claude code to this? The orchestration seems to be very important.

By tommyjepsen 2026-03-0810:42

I ran the Qwen3 Coder 30B through LM Studio and with OpenCode(Instead of Claude code). Did decent on M4 Max 32GB. https://www.tommyjepsen.com/blog/run-llm-locally-for-coding

By Aurornis 2026-03-0816:36

The 9B models are not useful for coding outside of very simple requests.

Qwen3.5 is confusing a lot of newcomers because it is very confident in the answers it gives. It can also regurgitate solutions to common test requests like “make a flappy bird clone” which misleads users into thinking it’s genetically smart.

Using the Qwen3.5 models for longer tasks and inspecting the output is a little more disappointing. They’re cool for something I can run locally but I don’t agree with all of the claims about being Sonnet-level quality (including previous Sonnet versions) in my experience with the larger models. The 9B model is not going to be close to Sonnet in any way.

By teaearlgraycold 2026-03-087:50

I loaded Qwen into LM Studio and then ran Oh My Pi. It automatically picked up the LM Studio API server. For some reason the 35B A3B model had issues with Oh My Pi's ability to pass a thinking parameter which caused it to crash. 27B did not have that issue for me but it's much slower.

Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...

The 35B model is still pretty slow on my machine but it's cool to see it working.

By badgersnake 2026-03-089:13

I’ve tried it on Claude code, Found it to be fairly crap. It got stuck in a loop doing the wrong thing and would not be talked out of it. I’ve found this bug that would stop it compiling right after compiling it, that sort of thing.

Also seemed to ignore fairly simple instructions in CLAUDE.md about building and running tests.

By andsoitis 2026-03-0813:44

I use Claude Code for agentic coding but it is better to use qwen3-coder in that case.

It qwen3-coder is better for code generation and editing, strong at multi-file agentic tasks, and is purpose-built for coding workflows.

In contrast, qwen3.5 is more capable at general reasoning, better at planning and architecture decisions, good balance of coding and thinking.

By jadbox 2026-03-0813:093 reply

Did you figure out how to fix Thinking mode? I had to turn it off completely as it went on forever, and I tried to fix it with different parameters without success.

By sammyteee 2026-03-0816:02

Thinking has definitely become a bit more convuluted in this model - I gave the prompt of "hey" and it thought for about two minutes straight before giving a bog-standard "hello, how can i help" reply etc

By andrekandre 2026-03-090:26

supposedly you can turn it off by passing `\no_think` or `/no_think` into the prompt but it never worked for me

what did work was passing / adding this json to the request body:

   { "chat_template_kwargs": {"enable_thinking": false}}

[0] https://github.com/QwenLM/Qwen3/discussions/1300

By agile-gift0262 2026-03-097:55

did you try with the recommended settings? the ones for thinking mode, general tasks, really worked for me. Especially the repetition_penalty. At first it wasn't working very well, and it was because I was using OpenWebUI's "Repeat Penalty" field, and that didn't work. I needed to set a custom field with the exact name

By bluerooibos 2026-03-0818:012 reply

These smaller models are fine for Q&A type stuff but are basically unuseable for anything agentic like large file modifications, coding, second brain type stuff - they need so much handholding. I'd be interested to see a demo of what the larger versions can do on better hardware though.

By NorwegianDude 2026-03-0819:561 reply

Qwen3.5 27B works very well, to the point that if you use money on Claude 4.5 Haiku you could save hundreds of USD each day by running it yourself on a consumer GPU at home.

By bluerooibos 2026-03-0923:37

Compared to Opus 4.6 though? And what sort of hardware/RAM is that running on - I'm assuming 32 or 64GB at least, right?

By regularfry 2026-03-0818:44

In some ways the handholding is the point. The way I used qwen2.5-coder in the past was as a rubber duck that happens to be able to type. You have to be in the loop with it, it's just a different style of agent use to what you might do with copilot or Claude.

By y42 2026-03-0819:41

> consumer-grade hardware

Not disagreeing per se, but a quick look at the installation instructions confirms what I assumed:

Yeah, you can run a highly quantized version on your 2020 Nvidia GPU. But:

- When inferencing, it occupies your "whole machine.". At least you have a modern interactive heating feature in your flat.

- You need to follow eleven-thousand nerdy steps to get it running; my mum is really looking forward to that.

- Not to mention the pain you went through installing Nvidia drivers; nothing my mum will ever manage in the near future.

... and all this to get something that merely competes with Haiku.

Don't get me wrong - I am exaggerating, I know. It's important to have competition and the opportunity to run "AI" on your own metal. But this reminds me of the early days of smartphones and my old XDA Neo. Sure, it was damn smart, and I remember all those jealous faces because of my "device from the future." But oh boy, it was also a PITA maintaining it.

Here we are now. Running AI locally is a sneak peek into the future. But as long as you need a CS degree and hardware worth a small car to achieve reasonable results, it's far from mainstream. Therefore, "consumer-grade hardware" sounds like a euphemism here.

I like how we nerds are living in our buble celebrating this stuff while 99% of mankind still doomscroll through facebook and laughing at (now AI generated) brain rot.

(No offense (ʘ‿ʘ)╯)

By mingodad 2026-03-088:474 reply

I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:

IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB

And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .

I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.

I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.

Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.

I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.

Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.

By danielhanchen 2026-03-0811:541 reply

Oh https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks might be helpful - it provides benchmarks for Q4_K_XL vs Q4_K_M etc for disk space vs KL Divergence (proxy for how close to the original full precision model)

Q4_0 and Q4_1 were supposed to provide faster inference, but tests showed it reduced accuracy by quite a bit, so they are deprecated now.

Q4_K_M and UD-Q4_K_XL are the same, just _XL is slightly bigger than _M

The naming convention is _XL > _L > _M > _S > _XS

By sowbug 2026-03-0820:221 reply

Thanks for all your contributions.

Do you think it's time for version numbers in filenames? Or at least a sha256sum of the merged files when they're big enough to require splitting?

Even with gigabit fiber, it still takes a long time to download model files, and I usually merge split files and toss the parts when I'm done. So by the time I have a full model, I've often lost track of exactly when I downloaded it, so I can't tell whether I have the latest. For non-split models, I can compare the sha256sum on HF, but not for split ones I've already merged. That's why I think we could use version numbers.

By danielhanchen 2026-03-102:121 reply

Thanks! Oh we do split if over 50GB - do you mean also split on 50GB shards? HuggingFace XET has an interesting feature where each file is divided into blocks, so it'll do a SHA256 on each block, and only update blocks

By sowbug 2026-03-103:03

That might be the answer -- something like BitTorrent that updates only the parts that need updating.

But I do think I'm identifying an unmet need. Qwen3.5-122B-A10B-BF16.gguf, for example: what's its sha256sum? I don't think the HF UI will tell you. I can only download the shards, verify each shard's sha256sum (which the HF UI does provide), llama-gguf-split --merge them, and then sha256sum the merged file myself. But I can't independently confirm that final sha256sum from any other source I trust.

By PhilippGille 2026-03-0810:231 reply

> would be nice to have a place that have a list of typical models/ hardware listed with it's common config parameters and memory usage

https://www.localscore.ai from Mozilla Builders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet

By JLook 2026-03-0811:14

[dead]

By ay 2026-03-089:561 reply

I tried qwen3.5:4b in ollama on my 4 year old Mac M1 with my own coding harness and it exhibited pretty decent tool calling, but it is a bit slow and seemed a little confused with the more complex tasks (also, I have it code rust, that might add complexity). The task was “find the debug that does X and make it conditional based on the whichever variable is controlled by the CLI ‘/debug foo’” - I didn’t do much with it after that.

It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.

I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.

HTH.

By spwa4 2026-03-0811:082 reply

What matters for Qwen models, and most/all local MoE models (ie. where the performance is limited) is memory bandwidth. This goes for small models too. Here's the top Apple chips by memory bandwidth (and to steal from clickbait: Apple definitely does not want you to think too closely about this):

M3 Ultra — 819 GB/s

M2 Ultra — 800 GB/s

M1 Ultra — 800 GB/s

M5 Max (40-core GPU) — 610 GB/s

M4 Max (16-core CPU / 40-core GPU) — 546 GB/s

M4 Max (14-core CPU / 32-core GPU) — 410 GB/s

M2 Max — 400 GB/s

M3 Max (16-core CPU / 40-core GPU) — 400 GB/s

M1 Max — 400 GB/s

Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.

TLDR: An M1 max chip is faster than all M5 chips, with the sole exception of the 40-GPU-core M5 max, the top model, only available in 64 and 128G versions. An M5 pro, any M5 pro (or any M* pro, or M3/M2 max chip) will be slower than an M1 max on LLM inference, and any Ultra chip, even the M1 Ultra, will be faster than any max chip, including the M5 max (though you may want the M2 ultra for bfloat16 support, maybe. It doesn't matter much for quantized models)

By embedding-shape 2026-03-0817:281 reply

For comparison, most recent (consumer) NVIDIA GPUs released:

- 5050 - MSRP: 249 USD - 320 GB/s

- 5060 - MSRP: 299 USD - 448 GB/s

- 5060 Ti - MSRP: 379 USD - 448 GB/s

- 5070 - MSRP: 549 USD - 672 GB/s

- 5070 Ti - MSRP: 749 USD - 896 GB/s

- 5080 - MSRP: 999 USD - 960 GB/s

- 5090 - MSRP: 1999 USD - 1792 GB/s

M3 Ultra seems to come close to a ~5070 Ti more or less.

By seanmcdirmid 2026-03-0818:001 reply

You should really list memory with the graphics cards, and above should list (unified) memory and prices as well with particular price points.

By embedding-shape 2026-03-0819:131 reply

I mean what I was curious (and maybe others) about was comparing it to parent's post, which is all about the memory bandwidth, hence the comparison.

By seanmcdirmid 2026-03-0822:152 reply

But it doesn't matter if you have 1000GB/s memory bandwidth if you only have 32GB of vram. Well, maybe for some applications it works out (image generation?), but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.

By embedding-shape 2026-03-0914:031 reply

> but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.

No one is arguing that either, this sub-thread is quite literally about the memory bandwidth. Of course there are more things to care about in real-life applications of all this stuff, again, no one is claiming otherwise. My reply was adding additional context to the "What matters [...] is memory bandwidth" parent comment, nothing more, hence the added context of what other consumer hardware does in memory bandwidth.

By seanmcdirmid 2026-03-0922:19

If we are talking about Apple silicon, where we can configure the memory separately from the bandwidth (and the memory costs the same for each processor), we can say something like "its all about bandwidth". If we switch to GPUs where that is no longer true, NVIDIA won't let you buy an 5090 with more 32GB of VRAM, then...we aren't comparing apples to apples anymore.

By ranger_danger 2026-03-095:131 reply

A 10GB 3080 still beats even an M2 Ultra with 192GB... memory bandwidth is not the only factor.

https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

By seanmcdirmid 2026-03-097:07

If the model is small enough to fit in to 10GB of VRAM the GPU can win.

But the bigger models are more useful, so that’s what people fixate on.

By spatular 2026-03-0819:13

There is also prompt processing that's compute-bound, and for agentic workflows it can matter more than tg, especially if the model is not of "thinking" type.

By siquick 2026-03-0811:16

This may help you work out the best quant to use for your use case.

https://www.siquick.com/blog/model-quantization-fine-tuning-...

By antirez 2026-03-089:242 reply

My private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations.

    1   │ DeepSeek API -- 100%
    2   │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
    3   │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
    4   │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
    5   │ qwen3.5:27b-q8_0 (thinking) -- 75.3%

I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.

By throwdbaaway 2026-03-0811:311 reply

Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.

By antirez 2026-03-0812:001 reply

I inspected manually and indeed the 27B is doing worse, but I believe it could be due to the exact GGUF in the ollama repository and/or with the need of adjusting the parameters. I'll try more stuff.

By andhuman 2026-03-0812:551 reply

Isn’t llama.cpp’s implementation of Qwen 3.5 better, or am I misinformed?

By antirez 2026-03-0814:20

There was a recent fix by ollama and I used it.

By alansaber 2026-03-0814:071 reply

Maybe a reductive question but are there any thinking models that don't (relatively) add much latency?

By ac29 2026-03-0815:44

The whole point of thinking is to throw more compute/tokens at a problem, so it will always add latency over non thinking modes/models. Many models do support variable thinking levels or thinking token budgets though, so you can set them to low/minimal thinking if you want only a minimal increase in latency versus no thinking.