As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.
I would recommend taking a look at this service: https://learn.microsoft.com/en-us/rest/api/computervision/re...
Microsoft Vision is so expensive and has a ridiculous rate limit, is slow, and isn't any better than what you can run yourself. You have to make every request over HTTP (with a rate limit), and there is no ability to do bulk jobs. It's also incredibly expensive.
I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).
Mistral's model was terrible when I tested it on non Latin characters and on anything that isn't neat printed text (i.e. handwriting)
I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...
My dataset could be described in a similar way. Very low quality, very odd layouts, information density where it's completely unnecessary.
And these contractors were relatively good operators compared to most.
Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?
With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.
And it spat it out.
It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.
Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias
Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)
When I want an LLM to be be brief, I will say things like "be brief", "don't ramble", etc.
When that fails, "shut the fuck up" always seems to do the trick.
I ripped into cursor today. It didn't change anything but I felt better lmao
Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).
The way I think of it, talking to an LLM is a bit like talking to myself or listening to an echo, since what I get back depends only on what I put in. If it senses that I'm frustrated, it will be inclined to make even more stuff up in an attempt to appease me, so that gets me nowhere.
I've found it more useful to keep it polite and "professional" and restart the conversation if we've begun going around in circles.
And besides, if I make a habit of behaving badly with LLMs, there's a good chance that I'll do it without thinking at some point and get in trouble.
It's a good habit to build now in case AGI actually happens out of the blue.
Gemini has purpose post training for bounding boxes if you haven't tried it.
The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.
shameless plug here for AMD's AI Dev Day - registration is open and they want feedback on what to focus on: https://www.amd.com/en/corporate/events/amd-ai-dev-day.html
also documented stack setup if could.
I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go
What's the cost of the fine-tuned model? If you were attempting to optimize for cost, would it be worth it to detect scan layouts to get better OCR?
Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.
the model runs on H200 in ~20s, costing about $2.4/hr. on L4 it’s cheaper at ~$0.3/hr but takes ~85s to finish. overall, H200 ends up cheaper at volume. my scan has a separate issue though: each page has two columns, so text from the right side sometimes overflows into the left. OCR can’t really tell where sentences start and end unless the layout is split by column.
just unsloth on colab using A100 and dataset on google drive.
So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn
LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.
If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.
[0]: https://lmstudio.ai/
Jumping from this for visibility - LM Studio really is the best option out there. Ollama is another runtime that I've used, but I've found it makes too many assumptions about what a computer is capable of and it's almost impossible to override those settings. It often overloads weaker computers and underestimates stronger ones.
LM Studio isn't as "set it and forget it" as Ollama is, and it does have a bit of a learning curve. But if you're doing any kind of AI development and you don't want to mess around with writing llama-cpp scripts all the time, it really can't be beat (for now).
Thank you! I will give it a try and see if I can get that 4090 working a bit.
You can use their models here chat.qwenlm.ai, its their official website
I wouldn't recommend using anything that can transmit data back to the CCP. The model itself is fine since it's open source (and you can run it firewalled if you're really paranoid), but directly using Alibaba's AI chat website should be discouraged.
AnythingLLM also good for that GUI experience!
I should add that sometimes LM Studio just feels better for the use case, same model same purpose seemingly different output usually when involving RAG, but Anything is definitely a very intuitive visual experience
Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(
Wait a moment... It gave you BOUNDING BOXES? That is awesome! That is a missing link I need for models.
i had success with tabula. you may not need ai. but fine if it works too.
People actually use tesseract? It's one of the worst OCR solutions out there. Forget it.
I would strongly emphasize:
CV != AI Vision
gpt-4o would breeze through your poor images.
It did not, unfortunately. When CV failed gpt-4o failed as well. I even had a list of valid invoice numbers & dates to help the models. Still, most failed.
Construction invoices are not great.
Did you try few-shotting examples when you hit problem cases? In my ziploc case, the model was failing if red sharpie was used vs black. A few shot hint fixed that.
Tbh, I had run the images through a few filters. The images that went through to AI were high contrast, black and white, with noise such as highlighters removed. I had tried 1 shot and few shot.
I think it was largely a formatting issue. Like some of these invoices have nonsense layouts. Perhaps Qwen works well because it doesn't assume left to right, top to bottom? Just speculating though
I’m very surprised. Have dealt with some really ugly inputs (handwritten text on full ziploc bags etc., stained torn handwritten recipe cards, etc.) with super good success.
The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said, i see so many that it might as well be the most impressive benchmaxxing today, if not just a genuinely SOTA open source model. They even released a closed source 1 trillion parameter model today as well that is sitting on no3(!) on lm arena. EVen their 80gb model is 17th, gpt-oss 120b is 52nd https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2...
They still suck at explaining which model they serve is which, though.
They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.
Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.
You know it's bad when OpenAI has a more clear naming scheme.
[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
> They still suck at explaining which model they serve is which, though.
"they" in this sentence probably applies to all "AI" companies.
Even the naming/versioning of OpenAI models is ridiculous, and then you can never find out which is actually better for your needs. Every AI company writes several paragraphs of fluffy text with lots of hand waving, saying how this model is better for complex tasks while this other one is better for difficult tasks.
Both Deepseek and Claude are exceptions. Simple versions and Sonnet is overall worse but faster than Opus for the same version.
Eh i mean often innovation is made just by letting a lot of fragmented, small teams of cracked nerds trying out stuff. It's way too early in the game. I mean, qwens release statements have anime etc. IBM, Bell, Google, Dell, many did it similarly, letting small focused teams having many attempts at cracking the same problem. All modern quant firms are doing basically the same as well. Anthropic is actually an exception, more like Apple.
it's sometimes not really a matter of which one is better but which one fits best.
For example many have switched to qwen3 models but some still vastly prefer the reasoning and output of QwQ (a qwen2.5 model).
And the difference between them: those with "plus" are closed weight, you can only access them through their api. The others are open-weight, so if they fit your use case, and if ever the want or need arise, you can download them, use them, even fine-tune them locally, even if qwen don't offer access to them any more.
If the naming is so clear to you, then why don't you explain: for a user who wants to use Qwen3-VL through an API, which one has better performance? Qwen3-VL Plus or Qwen3-VL 235b?
My precedent post should have answered this question. But since it didn't, I think I'm ill equipped to answer you in a satisfactory fashion, I would just be repeating myself.
> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive.
This "just" is incorrect.
The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334
(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?
Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)
The naming makes some sense here. It's backed by the very Chinese Alibaba and the government directly as well. It's almost a national project.
The Americans do that all the time. :P
> Do we say "The British"
Yes.
Yeah it's just weird Orientalism all over again
> Also I hate this "The Chinese" thing
to me it was positive assessment, I adore their craftsmanship and persistence in moving forward for long period of time.
It erases the individuals doing the actual research by viewing Chinese people as a monolith.
Interestingly, I've found that models like Kimi K2 spit out more organic, natural-sounding text than American models
Fails on the benchmarks compared to other SOTA models but the real-world experience is different
> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency.
This is what really grinds my gears about American AI and American technology in general lately, as an American myself. We used to do that! But over the last 10-15 years, it seems like all this country can do is try to throw more and more resources at something instead of optimizing what we already have.
Download more ram for this progressive web app.
Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.
Generate more electricity (hello Elon Musk).
Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.
If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage with the Qwen team members.
Let’s hope they’re allowed in the country and get a visa… it’s 50/50 these days
Registration full :-(