Qwen3-VL

Comments

By richardlblair 2025-09-2323:5411 reply

As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.

Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.

By iamflimflam1 2025-09-2410:281 reply

I would recommend taking a look at this service: https://learn.microsoft.com/en-us/rest/api/computervision/re...

By iamleppert 2025-09-2513:37

Microsoft Vision is so expensive and has a ridiculous rate limit, is slow, and isn't any better than what you can run yourself. You have to make every request over HTTP (with a rate limit), and there is no ability to do bulk jobs. It's also incredibly expensive.

By benterix 2025-09-2411:561 reply

I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).

By z2 2025-09-2415:14

Mistral's model was terrible when I tested it on non Latin characters and on anything that isn't neat printed text (i.e. handwriting)

By wiz21c 2025-09-249:111 reply

I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...

By richardlblair 2025-09-2413:12

My dataset could be described in a similar way. Very low quality, very odd layouts, information density where it's completely unnecessary.

And these contractors were relatively good operators compared to most.

By VladVladikoff 2025-09-240:054 reply

Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?

By richardlblair 2025-09-240:241 reply

With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.

And it spat it out.

By VladVladikoff 2025-09-240:443 reply

It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.

By wongarsu 2025-09-241:232 reply

Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias

Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)

1: https://arxiv.org/pdf/2402.14531

By arcanemachiner 2025-09-241:321 reply

When I want an LLM to be be brief, I will say things like "be brief", "don't ramble", etc.

When that fails, "shut the fuck up" always seems to do the trick.

By richardlblair 2025-09-242:57

I ripped into cursor today. It didn't change anything but I felt better lmao

By entropie 2025-09-242:33

Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).

By indigoabstract 2025-09-2412:12

The way I think of it, talking to an LLM is a bit like talking to myself or listening to an echo, since what I get back depends only on what I put in. If it senses that I'm frustrated, it will be inclined to make even more stuff up in an attempt to appease me, so that gets me nowhere.

I've found it more useful to keep it polite and "professional" and restart the conversation if we've begun going around in circles.

And besides, if I make a habit of behaving badly with LLMs, there's a good chance that I'll do it without thinking at some point and get in trouble.

By dabockster 2025-09-2417:35

It's a good habit to build now in case AGI actually happens out of the blue.

By Workaccount2 2025-09-240:41

Gemini has purpose post training for bounding boxes if you haven't tried it.

The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.

By rsalama2 2025-09-2422:08

shameless plug here for AMD's AI Dev Day - registration is open and they want feedback on what to focus on: https://www.amd.com/en/corporate/events/amd-ai-dev-day.html

By mh- 2025-09-240:221 reply

Do you have some example images and the prompt you tried?

By BOOSTERHIDROGEN 2025-09-242:30

also documented stack setup if could.

By netdur 2025-09-240:422 reply

I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go

By richardlblair 2025-09-2413:141 reply

What's the cost of the fine-tuned model? If you were attempting to optimize for cost, would it be worth it to detect scan layouts to get better OCR?

Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.

By netdur 2025-09-2422:12

the model runs on H200 in ~20s, costing about $2.4/hr. on L4 it’s cheaper at ~$0.3/hr but takes ~85s to finish. overall, H200 ends up cheaper at volume. my scan has a separate issue though: each page has two columns, so text from the right side sometimes overflows into the left. OCR can’t really tell where sentences start and end unless the layout is split by column.

By rexreed 2025-09-2411:461 reply

what fine tuning approach did you use?

By netdur 2025-09-2422:13

just unsloth on colab using A100 and dataset on google drive.

By unixhero 2025-09-242:203 reply

So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn

By baby_souffle 2025-09-242:263 reply

LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.

If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.

[0]: https://lmstudio.ai/

By dabockster 2025-09-2417:41

Jumping from this for visibility - LM Studio really is the best option out there. Ollama is another runtime that I've used, but I've found it makes too many assumptions about what a computer is capable of and it's almost impossible to override those settings. It often overloads weaker computers and underestimates stronger ones.

LM Studio isn't as "set it and forget it" as Ollama is, and it does have a bit of a learning curve. But if you're doing any kind of AI development and you don't want to mess around with writing llama-cpp scripts all the time, it really can't be beat (for now).

By unixhero 2025-09-248:16

Thank you! I will give it a try and see if I can get that 4090 working a bit.

By Alifatisk 2025-09-248:571 reply

You can use their models here chat.qwenlm.ai, its their official website

By dabockster 2025-09-2417:37

I wouldn't recommend using anything that can transmit data back to the CCP. The model itself is fine since it's open source (and you can run it firewalled if you're really paranoid), but directly using Alibaba's AI chat website should be discouraged.

By captainregex 2025-09-2412:101 reply

AnythingLLM also good for that GUI experience!

By captainregex 2025-09-2412:12

I should add that sometimes LM Studio just feels better for the use case, same model same purpose seemingly different output usually when involving RAG, but Anything is definitely a very intuitive visual experience

By creativebee 2025-09-2413:47

Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(

By kardianos 2025-09-2412:52

Wait a moment... It gave you BOUNDING BOXES? That is awesome! That is a missing link I need for models.

By pouetpouetpoue 2025-09-2417:02

i had success with tabula. you may not need ai. but fine if it works too.

By lofaszvanitt 2025-09-2514:08

People actually use tesseract? It's one of the worst OCR solutions out there. Forget it.

By re5i5tor 2025-09-2412:321 reply

I would strongly emphasize:

CV != AI Vision

gpt-4o would breeze through your poor images.

By richardlblair 2025-09-2413:112 reply

It did not, unfortunately. When CV failed gpt-4o failed as well. I even had a list of valid invoice numbers & dates to help the models. Still, most failed.

Construction invoices are not great.

By re5i5tor 2025-09-2415:411 reply

Did you try few-shotting examples when you hit problem cases? In my ziploc case, the model was failing if red sharpie was used vs black. A few shot hint fixed that.

By richardlblair 2025-09-2416:47

Tbh, I had run the images through a few filters. The images that went through to AI were high contrast, black and white, with noise such as highlighters removed. I had tried 1 shot and few shot.

I think it was largely a formatting issue. Like some of these invoices have nonsense layouts. Perhaps Qwen works well because it doesn't assume left to right, top to bottom? Just speculating though

By re5i5tor 2025-09-2414:27

I’m very surprised. Have dealt with some really ugly inputs (handwritten text on full ziploc bags etc., stained torn handwritten recipe cards, etc.) with super good success.

By deepdarkforest 2025-09-2322:224 reply

The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said, i see so many that it might as well be the most impressive benchmaxxing today, if not just a genuinely SOTA open source model. They even released a closed source 1 trillion parameter model today as well that is sitting on no3(!) on lm arena. EVen their 80gb model is 17th, gpt-oss 120b is 52nd https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2...

By jychang 2025-09-2323:384 reply

They still suck at explaining which model they serve is which, though.

They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.

Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.

You know it's bad when OpenAI has a more clear naming scheme.

[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

By jwr 2025-09-2410:141 reply

> They still suck at explaining which model they serve is which, though.

"they" in this sentence probably applies to all "AI" companies.

Even the naming/versioning of OpenAI models is ridiculous, and then you can never find out which is actually better for your needs. Every AI company writes several paragraphs of fluffy text with lots of hand waving, saying how this model is better for complex tasks while this other one is better for difficult tasks.

By viraptor 2025-09-2411:12

Both Deepseek and Claude are exceptions. Simple versions and Sonnet is overall worse but faster than Opus for the same version.

By deepdarkforest 2025-09-240:00

Eh i mean often innovation is made just by letting a lot of fragmented, small teams of cracked nerds trying out stuff. It's way too early in the game. I mean, qwens release statements have anime etc. IBM, Bell, Google, Dell, many did it similarly, letting small focused teams having many attempts at cracking the same problem. All modern quant firms are doing basically the same as well. Anthropic is actually an exception, more like Apple.

By marci 2025-09-2420:161 reply

it's sometimes not really a matter of which one is better but which one fits best.

For example many have switched to qwen3 models but some still vastly prefer the reasoning and output of QwQ (a qwen2.5 model).

And the difference between them: those with "plus" are closed weight, you can only access them through their api. The others are open-weight, so if they fit your use case, and if ever the want or need arise, you can download them, use them, even fine-tune them locally, even if qwen don't offer access to them any more.

By jychang 2025-09-252:281 reply

If the naming is so clear to you, then why don't you explain: for a user who wants to use Qwen3-VL through an API, which one has better performance? Qwen3-VL Plus or Qwen3-VL 235b?

By marci 2025-09-2517:28

My precedent post should have answered this question. But since it didn't, I think I'm ill equipped to answer you in a satisfactory fashion, I would just be repeating myself.

By nl 2025-09-240:455 reply

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive.

This "just" is incorrect.

The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334

(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?

Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)

By viraptor 2025-09-2411:16

The naming makes some sense here. It's backed by the very Chinese Alibaba and the government directly as well. It's almost a national project.

By taneq 2025-09-242:04

The Americans do that all the time. :P

By Mashimo 2025-09-248:25

> Do we say "The British"

Yes.

By mamami 2025-09-240:54

Yeah it's just weird Orientalism all over again

By riku_iki 2025-09-244:371 reply

> Also I hate this "The Chinese" thing

to me it was positive assessment, I adore their craftsmanship and persistence in moving forward for long period of time.

By mrtesthah 2025-09-246:21

It erases the individuals doing the actual research by viewing Chinese people as a monolith.

By spaceman_2020 2025-09-249:06

Interestingly, I've found that models like Kimi K2 spit out more organic, natural-sounding text than American models

Fails on the benchmarks compared to other SOTA models but the real-world experience is different

By dabockster 2025-09-2417:50

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency.

This is what really grinds my gears about American AI and American technology in general lately, as an American myself. We used to do that! But over the last 10-15 years, it seems like all this country can do is try to throw more and more resources at something instead of optimizing what we already have.

Download more ram for this progressive web app.

Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.

Generate more electricity (hello Elon Musk).

Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.

By helloericsf 2025-09-2323:152 reply

If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage with the Qwen team members.

By alfiedotwtf 2025-09-247:55

Let’s hope they’re allowed in the country and get a visa… it’s 50/50 these days

By dazzaji 2025-09-244:05

Registration full :-(