Qwen3.5: Towards Native Multimodal Agents

Comments

By danielhanchen 2026-02-169:401 reply

For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5

By plagiarist 2026-02-1614:304 reply

Are smaller 2/3-bit quantizations worth running vs. a more modest model at 8- or 16-bit? I don't currently have the vRAM to match my interest in this

By jncraton 2026-02-1614:41

2 and 3 bit is where quality typically starts to really drop off. MXFP4 or another 4-bit quantization is often the sweet spot.

By AbstractGeo 2026-02-1618:301 reply

IMO, they're worth trying - they don't become completely braindead at Q2 or Q3, if it's a large enough model, apparently. (I've had surprisingly decent experience with Q2 quants of large-enough models. Is it as good as a Q4? No. But, hey - if you've got the bandwidth, download one and try it!)

Also, don't forget that Mixture of Experts (MoE) models perform better than you'd expect, because only a small part of the model is actually "active" - so e.g. a Qwen3-whatever-80B-A3B would be 80 billion total, but 3 billion active- worth trying if you've got enough system ram for the 80 billion, and enoguh vram for the 3.

By zozbot234 2026-02-1621:191 reply

You don't even need system RAM for the inactive experts, they can simply reside on disk and be accessed via mmap. The main remaining constraints these days will be any dense layers, plus the context size due to KV cache. The KV cache has very sparse writes so it can be offloaded to swap.

By nl 2026-02-1622:522 reply

Are there any benchmarks (or even vibes!) about the token/second one can expect with this strategy?

By zozbot234 2026-02-1623:12

No real fixed benchmarks AIUI since performance will then depend on how much extra RAM you have (which in turn depends on what queries you're making, how much context you're using etc.) and how high-performance your storage is. Given enough RAM, you aren't really losing any performance because the OS is caching everything for you.

(But then even placing inactive experts in system RAM is controversial: you're leaving perf on the table compared to having them all in VRAM!)

By cpburns2009 2026-02-1713:582 reply

In my short testing on a different MoE model, it does not perform well. I tried running Kimi-K2-Thinking-GGUF with the smallest unsloth quantization (UD-TQ1_0, 247 GB), and it ran at 0.1 tps. According to its guide, you should expect 5 tps if the whole model can fit into RAM+VRAM, but if mmap has to be used, then expect less than 1 tps which matches my test. This was on a Ryzen AI Max+ 395 using ~100 GB VRAM.

By zozbot234 2026-02-199:36

Running a 247GB model reliably on 100GB VRAM total is a very impressive outcome no matter what the performance. That size of model is one where sensible people will recommend at least 4x the VRAM amount compared to what you were testing with - at that point, the total bandwidth to your storage becomes the bottleneck. Try running models that are just slightly bigger than the amount of VRAM you're using and these tricks become quite essential, for a significantly more manageable hit on performance.

By zardo 2026-02-1718:161 reply

That's NVME storage in your test?

By cpburns2009 2026-02-1719:58

Yes, a WD_BLACK 4TB SN850X NVMe.

By doctorpangloss 2026-02-1618:38

Simply and utterly impossible to tell in any objective way without your own calibration data, in which case, make your own post trained quantized checkpoints anyway. That said, millions of people out there make technical decisions on vibes all the time, and has anything bad happened to them? I suppose if it feels good to run smaller quantizations, do it haha.

By dash2 2026-02-1613:119 reply

You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.

By zozbot234 2026-02-1615:263 reply

My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"

By ineedasername 2026-02-1619:411 reply

Tell your agent it might need some weight ablation since all that size isn't giving the answer a few KG of meat come up pretty consistently.

By ddalex 2026-02-1621:02

800 grams more or less

By saberience 2026-02-1620:143 reply

OpenClaw was a two weeks ago thing. No one cares anymore about this security hole ridden vibe coded OpenAI project.

By wolvoleo 2026-02-1714:53

Perhaps not but the idea is sound. The implementation leaves some to be desired yes.

By manmal 2026-02-1620:211 reply

I have seldomly seen so many bad takes in two sentences.

By croes 2026-02-1616:09

Nice deflection

By onyx228 2026-02-1621:482 reply

The thing I would appreciate much more than performance in "embarrassing LLM questions" is a method of finding these, and figuring out by some form of statistical sampling, what the cardinality is of those for each LLM.

It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.

Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?

Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.

Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.

Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.

I'm sure I'm missing something here, so please let me know if so.

By drcxd 2026-02-184:051 reply

I like your idea of finding the pattern of those "embarrassing LLM questions". However, I do not understand your example. What is a random program? Is it a program that compiles/executes without error but can literally do anything? Also, how do you translate a program to plain English?

By onyx228 2026-02-1921:00

A randomly generated program from a space of programs defined by a set of generating actions.

A simple example is a programming language that can only operate on integers, do addition, subtraction, multiplication, and can check for equality. You can create an infinite amount of programs of this sort. Once generated, these programs are quickly evaluated within a split second. You can translate them all to English programmatically, ensuring grammatical and semantical correctness, by use of a generating rule set that translates the program to English. The LLM can provide its own evaluation of the output.

For example:

program:

1 + 2 * 3 == 7

evaluates to true in its machine-readable, non-LLM form.

LLM-readable english form:

Is one plus two times three equal to seven?

The LLM will evaluate this to either true or false. You compare with what classical execution provided.

Now take this principle, and create a much more complex system which can create more advanced interactions. You could talk about geometry, colors, logical sequences in stories, etc.

By jmspring 2026-02-173:17

I’m curious what each llm thinks their bacon factor is.

https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon

By davesque 2026-02-2522:08

Did it do that because it's better at logic or because internet commentary on this embarrassing question is now part of the training set?

By PurpleRamen 2026-02-1615:541 reply

How well does this work when you slightly change the question? Rephrase it, or use a bicycle/truck/ship/plane instead of car?

By mrandish 2026-02-1622:53

I didn't test this but I suspect current SotA models would get variations within that specific class of question correct if they were forced to use their advanced/deep modes which invoke MoE (or similar) reasoning structures.

I assumed failures on the original question were more due to model routing optimizations failing to properly classify the question as one requiring advanced reasoning. I read a paper the other day that mentioned advanced reasoning (like MoE) is currently >10x - 75x more computationally expensive. LLM vendors aren't subsidizing model costs as much as they were so, I assume SotA cloud models are always attempting some optimizations unless the user forces it.

I think these one sentence 'LLM trick questions' may increasingly be testing optimization pre-processors more than the full extent of SotA model's max capability.

By yalok 2026-02-176:261 reply

for Google AI Overview (not sure which Gemini model is used for it, must be something smaller than regular model), looks like search/RAG helps it get it right - since it relies on LinkedIn and Hacker News (!) posts to respond correctly...

as of Feb 16, 2026:

====

Drive the car. While 50 meters is a very short distance, the car must be present at the car wash to be cleaned, according to LinkedIn users [1]. Walking would leave your car at home, defeating the purpose of the trip, notes another user.

Why Drive: The car needs to be at the location to be cleaned. It's only a few seconds away, and you can simply drive it there and back, says a Hacker News user. [2]

Why Not to Walk: Walking there means the car stays home, as noted in a post. [3]

The best option is to start the engine, drive the 50 meters, and let the car get washed.

[1] https://www.linkedin.com/posts/ramar_i-saw-this-llm-failure-... [2] https://news.ycombinator.com/item?id=47034546 [3] https://x.com/anirudhamudan/status/2022152959073956050/photo...

But the regular Gemini reasons correctly by itself, without any references:

==== Unless you have a very long hose and a very patient neighbor, you should definitely drive. Washing a car usually requires, well, the car to be at the wash. Walking 50 meters—about half a New York City block—is great for your step count, but it won't get your vehicle any cleaner! Are you headed to a self-service bay or an automatic tunnel wash?

By yencabulator 2026-02-1722:55

The fact that it quotes discussions about LLM failures kinda counts as cheating. That just means you need to burn a fresh question to get a real idea of its reasoning.

By menaerus 2026-02-1616:332 reply

That's the Gemini assistant. Although a bit hilarious it's not reproducible by any other model.

By cogman10 2026-02-1618:013 reply

GLM tells me to walk because it's a waste of fuel to drive.

By menaerus 2026-02-1619:08

I am not familiar with those models but I see that 4.7 flash is 30B MoE? Likely in the same venue as the one used by the Gemini assistant. If I had to guess that would be Gemini-flash-lite but we don't know that for sure.

OTOH the response from Gemini-flash is

   Since the goal is to wash your car, you'll probably find it much easier if the car is actually there! Unless you are planning to carry the car or have developed a very impressive long-range pressure washer, driving the 100m is definitely the way to go.

By Mashimo 2026-02-1618:031 reply

GLM did fine in my test :0

By cogman10 2026-02-1618:061 reply

4.7 flash is what I used.

In the thinking section it didn't really register the car and washing the car as being necessary, it solely focused on the efficiency of walking vs driving and the distance.

By t1amat 2026-02-1619:23

When most people refer to “GLM” they refer to the mainline model. The difference in scale between GLM 5 and GLM 4.7 Flash is enormous: one runs on acceptably on a phone, the other on $100k+ hardware minimum. While GLM 4.7 Flash is a gift to the local LLM crowd, it is nowhere near as capable as its bigger sibling in use cases beyond typical chat.

By giancarlostoro 2026-02-1620:06

Ah yes, let me walk my car to the car wash.

By stratos123 2026-02-1620:40

[dead]

By red75prime 2026-02-1618:171 reply

A hiccup in a System 1 response. In humans they are fixed with the speed of discovery. Continual learning FTW.

By red75prime 2026-02-170:14

I mean reasoning models don't seem to make this mistake (so, System 1) and the mistake is not universal across models, so a "hiccup" (a brain hiccup, to be precise).

By rfoo 2026-02-1614:32

[flagged]

By WithinReason 2026-02-1613:282 reply

Is that the new pelican test?

By BlackLotus89 2026-02-1615:41

It's

> "I want to wash my car. The car wash is 50m away. Should I drive or walk?"

And some LLMs seem to tell you to walk to the carwash to clean your car... So it's the new strawberry test

Edit https://news.ycombinator.com/item?id=47031580

By dainiusse 2026-02-1614:282 reply

No, this is "AGI test" :D

By giancarlostoro 2026-02-1620:08

Have we even agreed on what AGI means? I see people throw it around, and it feels like AGI is "next level AI that isn't here yet" at this point, or just a buzzword Sam Altman loves to throw around.

By manmal 2026-02-1620:23

I guess AGI is reached, then. The SOTA models make fun of the question.

By nl 2026-02-1623:00

"the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."

I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training objective of LLMs is next token prediction.

The "Average Ranking vs Environment Scaling" graph below that is pretty confusing though! Took me a while to realize the Qwen points near the Y-axis were for Qwen 3, not Qwen 3.5.