Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

2025-11-307:2726582the-decoder.com

A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours…

Content

summary Summary

A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours of video footage.

The system handles massive data loads, processing two-hour videos or hundreds of document pages within a 256,000-token context window.

In "needle-in-a-haystack" tests, the flagship 235-billion-parameter model located individual frames in 30-minute videos with 100 percent accuracy. Even in two-hour videos containing roughly one million tokens, accuracy held at 99.5 percent. The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.

Heatmap mit Video-Längen auf der y-Achse und Frame-Positionen auf der x-Achse. Die meisten Zellen zeigen hohe Genauigkeitswerte in Prozent, mit perfekten Ergebnissen bei kürzeren Videos.
The needle-in-a-haystack test measures the model's ability to locate specific frames in long videos. | Image: Alibaba

In published benchmarks, the Qwen3-VL-235B-A22B model often beats Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1 - even when competitors use reasoning features or high thinking budgets. The model dominates visual math tasks, scoring 85.8 percent on MathVista compared to GPT-5's 81.3 percent. On MathVision, it leads with 74.6 percent, ahead of Gemini 2.5 Pro (73.3 percent) and GPT-5 (65.8 percent).

Ad

Tabelle mit Benchmark-Ergebnissen von Qwen3-VL-235B, Gemini 2.5 Pro, OpenAI GPT-5 und Claude Opus 4.1
Gemini's older 2.5 Pro model maintains a slight lead in general image understanding. | Image: Alibaba

The model also shows range in specialized benchmarks. It scored 96.5 percent on the DocVQA document comprehension test and 875 points on OCRBench, supporting 39 languages - nearly four times as many as its predecessor.

Balkendiagramm der OCR-Genauigkeit von Qwen3-VL für 39 Sprachen, wobei die meisten Balken über der 70-Prozent-Marke liegen.
Qwen3-VL achieves over 70 percent accuracy on OCR tasks in 32 of the 39 supported languages. | Image: Alibaba

Alibaba claims the system demonstrates new capabilities in GUI agent tasks. It achieved 61.8 percent accuracy on ScreenSpot Pro, which tests navigation in graphical user interfaces. On AndroidWorld, where the system must independently operate Android apps, Qwen3-VL-32B hit 63.7 percent.

The model handles complex, multi-page PDF documents as well. It scored 56.2 percent on MMLongBench-Doc for long document analysis. On the CharXiv benchmark for scientific charts, it reached 90.5 percent on description tasks and 66.2 percent on complex reasoning questions.

It is not a clean sweep, however. In the complex MMMU-Pro test, Qwen3-VL scored 69.3 percent, trailing GPT-5's 78.4 percent. Commercial competitors also generally lead in video QA benchmarks. The data suggests Qwen3-VL is a specialist in visual math and documents, but still lags in general reasoning.

Key technical advances for multimodal AI

The technical report outlines three main architectural upgrades. First, "interleaved MRoPE" replaces the previous position embedding method. Instead of grouping mathematical representations by dimension (time, horizontal, vertical), the new approach distributes them evenly across all available mathematical areas. This change aims to boost performance on long videos.

Recommendation

Schematische Darstellung der Qwen3-VL-Architektur mit Vision Encoder links und Large Language Model rechts, verbunden durch Datenflüsse und DeepStack-Verbindungen.
Qwen3-VL combines a vision encoder and language model to process text, images, and videos simultaneously. DeepStack uses visual information from different processing levels. | Image: Alibaba

Second, DeepStack technology allows the model to access intermediate results from the vision encoder, not just the final output. This gives the system access to visual information at different levels of detail.

Third, a text-based timestamp system replaces the complex T-RoPE method found in Qwen2.5-VL. Instead of assigning a mathematical time position to every video frame, the system now inserts simple text markers like "<3.8 seconds>" directly into the input. This simplifies the process and improves the model's grasp of time-based video tasks.

Training at scale with one trillion tokens

Alibaba trained the model in four phases on up to 10,000 GPUs. After learning to link images and text, the system underwent full multimodal training on about one trillion tokens. Data sources included web scrapes, 3 million PDFs from Common Crawl, and over 60 million STEM tasks.

In later phases, the team gradually expanded the context window from 8,000 to 32,000 and finally to 262,000 tokens. The "Thinking" variants received specific chain-of-thought training, allowing them to explicitly map out reasoning steps for better results on complex problems.

Ad

Open weights under Apache 2.0

All Qwen3-VL models released since September are available under the Apache 2.0 license with open weights on Hugging Face. The lineup includes dense variants ranging from 2B to 32B parameters, as well as mixture-of-experts models: the 30B-A3B and the massive 235B-A22B.

While features like extracting frames from long videos aren't new - Google's Gemini 1.5 Pro handled this in early 2024 - Qwen3-VL offers competitive performance in an open package. With the previous Qwen2.5-VL already common in research, the new model is likely to drive further open-source development.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.


Read the original article

Comments

  • By coppsilgold 2025-12-032:222 reply

    > The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.

    This seems to be somewhat unwise. Such an insertion would qualify as an anomaly. And if it's also trained that way, would you not train the model to find artificial frames where they don't belong?

    Would it not have been better to find a set of videos where something specific (common, rare, surprising, etc) happens at some time and ask the model about that?

    • By IanCal 2025-12-0314:44

      That rather depends on exactly how this is done, although it's a useful upper bound for many tasks either way. You could say the same for images and yet due to the way some work they straight up cannot see in certain ways.

      This could describe adding a frame of nonsense into an existing video.

      It also could describe finding a semantically useful thing in an actual video, where the exact location is randomised by looking at different time crops of the video. For example, finding a book on a desk in a video that's only there in a panning shot, and you then see if it can find it in a 10s cut, 20s cut, 10 minute cut, etc, and near the start/middle/end.

      Here's the paper: https://arxiv.org/pdf/2511.21631

      > To evaluate the model’s capability in processing long-context inputs, we construct a video “Needle-ina-Haystack” evaluation on Qwen3-VL-235B-A22B-Instruct. In this task, a semantically salient “needle” frame—containing critical visual evidence—is inserted at varying temporal positions within a long video. The model is then tasked with accurately locating the target frame from the long video and answering the corresponding question. During evaluation, videos are uniformly sampled at 1 FPS, and frame resolution is dynamically adjusted to maintain a constant visual token budget.

      This potentially sounds more like the former, but I can't find more accurate information on how this works.

      Regardless I'd say again that while not the whole story things like this really are useful to know, and can be very important to test - it's really not a given that models can always find anything in their context window, perhaps even more so for video.

    • By bigmadshoe 2025-12-0314:012 reply

      Yeah the needle in a haystack tests are so stupid. It seems clear with LLMs that performance degrades massively with context size, yet those tests claim the model performs perfectly.

      • By patates 2025-12-0314:32

        As someone who abuses gemini regularly with a 90% full context, the model performance does degrade for sure but I wouldn't call it massively.

        I can't show any evidence as I don't have such tests, but it's like coding normally vs coding after a beer or two.

        For the massive effect, fill it 95% and we're talking vodka shots. 99%? A zombie who can code. But perhaps that's not fair when you have 1M token context size.

  • By mikae1 2025-12-031:074 reply

    Hope this on day will be used for auto-tagging all video assets with time codes. The dream of being able to search for running horse and find a clip containing a running horse at 4m42s in one of thousands of clips.

    • By tontonius 2025-12-0313:17

      this is a solved problem already — check out https://getjumper.io where you can do exactly this (search through 100s of hours) offline and locally.

      Disclaimer: co-founder

    • By laidoffamazon 2025-12-033:092 reply

      It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself

      • By vhcr 2025-12-036:561 reply

        I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?

        • By laidoffamazon 2025-12-0315:25

          I split per scene using pyscenedetect and sampled from each. Distance is via cosine similarity- I fed it into qdrant

      • By dynode 2025-12-035:091 reply

        Would you be willing to share more details of what you did?

        • By laidoffamazon 2025-12-0320:34

          Sure. I had a lot of help from Claude Opus 4.5, but it was roughly:

          - Using pyscenedetect to split each video on a per scene level

          - Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)

          - Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)

          - Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.

          I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.

    • By ArnavAgrawal03 2025-12-034:272 reply

      you can do that with Morphik already :)

      We use an embedding model that processes videos and allows you to perform RAG on them.

      • By eurekin 2025-12-0311:13

        Would it allow me to query my library for every movie that contains dance routing move1-move2-move3 in that order?

      • By arresin 2025-12-039:00

        Rag as in the content is used to generate an answer or rag as in searching for a video?

    • By xnx 2025-12-041:06

      Gemini already does this (and has for awhile): https://ai.google.dev/gemini-api/docs/video-understanding

  • By clusterhacks 2025-12-031:451 reply

    I was playing around with Qwen3-VL to parse PDFs - meaning, do some OCR data extraction from a reasonably well-formated PDF report. Failed miserably, although I was using the 30B-A3B model instead of the larger one.

    I like the Qwen models and use them for other tasks successfully. It is so interesting how LLMs will do quite well in one situation and quite badly in another.

HackerNews