Mistral 3 family of models released

2025-12-0215:01826236mistral.ai

A family of frontier open-source multimodal models

Today, we announce Mistral 3, the next generation of Mistral models. Mistral 3 includes three state-of-the-art small, dense models (14B, 8B, and 3B) and Mistral Large 3 – our most capable model to date – a sparse mixture-of-experts trained with 41B active and 675B total parameters. All models are released under the Apache 2.0 license. Open-sourcing our models in a variety of compressed formats empowers the developer community and puts AI in people’s hands through distributed intelligence.

The Ministral models represent the best performance-to-cost ratio in their category. At the same time, Mistral Large 3 joins the ranks of frontier instruction-fine-tuned open-source models.

Mistral Large 3: A state-of-the-art open model

Chart Base Models (1)

3 Model Performance Comparison (instruct)

Mistral Large 3 is one of the best permissive open weight models in the world, trained from scratch on 3000 of NVIDIA’s H200 GPUs. Mistral Large 3 is Mistral’s first mixture-of-experts model since the seminal Mixtral series, and represents a substantial step forward in pretraining at Mistral. After post-training, the model achieves parity with the best instruction-tuned open-weight models on the market on general prompts, while also demonstrating image understanding and best-in-class performance on multilingual conversations (i.e., non-English/Chinese).

Mistral Large 3 debuts at #2 in the OSS non-reasoning models category (#6 amongst OSS models overall) on the LMArena leaderboard.

Lm Arena Chart Ml3

We release both the base and instruction fine-tuned versions of Mistral Large 3 under the Apache 2.0 license, providing a strong foundation for further customization across the enterprise and developer communities. A reasoning version is coming soon! 

Mistral, NVIDIA, vLLM & Red Hat join forces to deliver faster, more accessible Mistral 3

Working in conjunction with vLLM and Red Hat, Mistral Large 3 is very accessible to the open-source community. We’re releasing a checkpoint in NVFP4 format, built with llm-compressor. This optimized checkpoint lets you run Mistral Large 3 efficiently on Blackwell NVL72 systems and on a single 8×A100 or 8×H100 node using vLLM.

Delivering advanced open-source AI models requires broad optimization, achieved through a partnership with NVIDIA. All our new Mistral 3 models, from Large 3 to Ministral 3, were trained on NVIDIA Hopper GPUs to tap high-bandwidth HBM3e memory for frontier-scale workloads. NVIDIA’s extreme co-design approach brings hardware, software, and models together. NVIDIA engineers enabled efficient inference support for TensorRT-LLM and SGLang for the complete Mistral 3 family, for efficient low-precision execution.

For Large 3’s sparse MoE architecture, NVIDIA integrated state-of-the-art Blackwell attention and MoE kernels, added support for prefill/decode disaggregated serving, and collaborated with Mistral on speculative decoding, enabling developers to efficiently serve long-context, high-throughput workloads on GB200 NVL72 and beyond. On the edge, delivers optimized deployments of the Ministral models on DGX Spark, RTX PCs and laptops, and Jetson devices, giving developers a consistent, high-performance path to run these open models from data center to robot.

We are very thankful for the collaboration and want to thank vLLM, Red Hat, and NVIDIA in particular.

Ministral 3: State-of-the-art intelligence at the edge

4 Gpqa Diamond Accuracy

For edge and local use cases, we release the Ministral 3 series, available in three model sizes: 3B, 8B, and 14B parameters. Furthermore, for each model size, we release base, instruct, and reasoning variants to the community, each with image understanding capabilities, all under the Apache 2.0 license. When married with the models’ native multimodal and multilingual capabilities, the Ministral 3 family offers a model for all enterprise or developer needs.

Furthermore, Ministral 3 achieves the best cost-to-performance ratio of any OSS model. In real-world use cases, both the number of generated tokens and model size matter equally. The Ministral instruct models match or exceed the performance of comparable models while often producing an order of magnitude fewer tokens. 

For settings where accuracy is the only concern, the Ministral reasoning variants can think longer to produce state-of-the-art accuracy amongst their weight class - for instance 85% on AIME ‘25 with our 14B variant.

Available Today

Mistral 3 is available today on Mistral AI Studio, Amazon Bedrock, Azure Foundry, Hugging Face (Large 3 & Ministral), Modal, IBM WatsonX, OpenRouter, Fireworks, Unsloth AI, and Together AI. In addition, coming soon on NVIDIA NIM and AWS SageMaker.

One more thing… customization with Mistral AI

For organizations seeking tailored AI solutions, Mistral AI offers custom model training services to fine-tune or fully adapt our models to your specific needs. Whether optimizing for domain-specific tasks, enhancing performance on proprietary datasets, or deploying models in unique environments, our team collaborates with you to build AI systems that align with your goals. For enterprise-grade deployments, custom training ensures your AI solution delivers maximum impact securely, efficiently, and at scale.

Get started with Mistral 3

The future of AI is open. Mistral 3 redefines what’s possible with a family of models built for frontier intelligence, multimodal flexibility, and unmatched customization. Whether you’re deploying edge-optimized solutions with Ministral 3 or pushing the boundaries of reasoning with Mistral Large 3, this release puts state-of-the-art AI directly into your hands.

Why Mistral 3?

  • Frontier performance, open access: Achieve closed-source-level results with the transparency and control of open-source models.

  • Multimodal and multilingual: Build applications that understand text, images, and complex logic across 40+ native languages.

  • Scalable efficiency: From 3B to 675B active parameters, choose the model that fits your needs, from edge devices to enterprise workflows.

  • Agentic and adaptable: Deploy for coding, creative collaboration, document analysis, or tool-use workflows with precision.

Next Steps

Science has always thrived on openness and shared discovery. As pioneering French scientist and two-time Nobel laureate Marie Skłodowska-Curie once said, “Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.” 

This philosophy drives our mission at Mistral AI. We believe that the future of AI should be built on transparency, accessibility, and collective progress. With this release, we invite the world to explore, build, and innovate with us, unlocking new possibilities in reasoning, efficiency, and real-world applications.

Together, let’s turn understanding into action.


Read the original article

Comments

  • By barrell 2025-12-0216:047 reply

    I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and follows formatting instructions to the letter. I was (and still am) super super impressed. Even if it does not hold up in benchmarks, it still outperformed in practice.

    I'm not sure how these new models compare to the biggest and baddest models, but if price, speed, and reliability are a concern for your use cases I cannot recommend Mistral enough.

    Very excited to try out these new models! To be fair, mistral-3-medium-0525 still occasionally produces gibberish ~0.1% of my use cases (vs gpt-5's 15% failure rate). Will report back if that goes up or down with these new models

    • By mrtksn 2025-12-0216:395 reply

      Some time ago I canceled all my paid subscriptions to chatbots because they are interchangeable so I just rotate between Grok, ChatGPT, Gemini, Deepseek and Mistral.

      On the API side of things my experience is that the model behaving as expected is the greatest feature.

      There I also switched to Openrouter instead of paying directly so I can use whatever model fits best.

      The recent buzz about ad-based chatbot services is probably because the companies no longer have an edge despite what the benchmarks say, users are noticing it and cancel paid plans. Just today OpenAI offered me 1 month free trial as if I wasn’t using it two months ago. I guess they hope I forget to cancel.

      • By barrell 2025-12-0216:511 reply

        Yep I spent 3 days optimizing my prompt trying to get gpt-5 to work. Tried a bunch of different models (some Azure some OpenRouter) and got a better success rate with several others without any tailoring of the prompt.

        Was really plug and play. There are still small nuances to each one, but compared to a year ago prompts are much more portable

        • By distalx 2025-12-0320:091 reply

          What tools or process do you use to optimize your prompts?

          • By amy_petrik 2025-12-0323:09

            usually either use Grok to optimize a mistral prompt, or you can use gemini to optimize a chatGPT prompt. It's best to keep those pairs of AIs and not cross streams!

      • By barbazoo 2025-12-0216:492 reply

        > I guess they hope I forget to cancel.

        Business model of most subscription based services.

        • By viking123 2025-12-037:43

          For me it's just that I am too lazy to start switching from my GPT subscription, I use it with codex and it's very good for my use-case. And the price at least here in Asia is not expensive at all for the plus tier. The amount of tokens are so much that I usually cannot even spend the weekly quota, although I use context smartly and know my codebase so I can always point it to right place right away.

          I feel like at least for normies if they are familiar with ChatGPT, it might be hard to make them switch especially if they are subscribed.

        • By b3ing 2025-12-0313:54

          I estimate at 10% of meetup runs like that

      • By acuozzo 2025-12-0218:141 reply

        > because they are interchangeable

        What is your use-case?

        Mine is: I use "Pro"/"Max"/"DeepThink" models to iterate on novel cross-domain applications of existing mathematics.

        My interaction is: I craft a detailed prompt in my editor, hand it off, come back 20-30 minutes later, review the reply, and then repeat if necessary.

        My experience is that they're all very, very different from one another.

        • By mrtksn 2025-12-0218:49

          my use case is Google replacement, things that I can do by myself so I can verify and things that are not important so I don’t have to verify.

          Sure, they produce different output so sometimes I will run the same thing on a few different models when Im not sure or happy but I’d don’t delegate the thinking part actually, I always give a direction in my prompts. I don’t see myself running 30min queries because I will never trust the output and will have to do all the work myself. Instead I like to go step by step together.

      • By giancarlostoro 2025-12-0218:152 reply

        Maybe give Perplexity a shot? It has Grok, ChatGPT, Gemini, Kimi K2, I dont think it has Mistral unfortunately.

        • By mrtksn 2025-12-0218:511 reply

          I like perplexity actually but haven’t been using it since some time. Maybe I should give it a go :)

          • By ecommerceguy 2025-12-0223:581 reply

            I use their browser called Comet for finance related research. Very nice. I use pretty much all of the main ai's, chat, deep, gem, claude - all i have found little niche use case that i'm sure will rotate at some point in an upgrade cycle. there are so many ai's i don't see the point in paying for one. I'm convinced they will need ads to survive.

            excited to add mistral to the rotation!

            • By giancarlostoro 2025-12-032:47

              Oh man I use Comet nearly daily, I tried setting perplexity as my new tab page on other browsers and for some reason its not the same. I mostly use it that boring way too.

        • By VHRanger 2025-12-032:57

          Kagi has Mistral as well

    • By druskacik 2025-12-0217:061 reply

      This is my experience as well. Mistral models may not be the best according to benchmarks and I don't use them for personal chats or coding, but for simple tasks with pre-defined scope (such as categorization, summarization, etc.) they are the option I choose. I use mistral-small with batch API and it's probably the best cost-efficient option out there.

      • By leobg 2025-12-038:452 reply

        Did you compare it to gemini-2.0-flash-lite?

        • By leobg 2025-12-0311:01

          Answering my own question:

          Artificial Analysis ranks them close in terms of price (both 0.3 USD/1M tokens) and intelligence (27 / 29 for gemini/mistral), but ranks gemini-2.0-flash-lite higher in terms of speed (189 tokens/s vs. 130).

          So they should be interchangeable. Looking forward to testing this.

          [0] https://artificialanalysis.ai/?models=o3%2Cgemini-2-5-pro%2C...

        • By druskacik 2025-12-0322:14

          I did some vibe-evals only and it seemed slightly worse for my use case, so I didn't change it.

    • By mbowcut2 2025-12-0217:464 reply

      It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

      • By pants2 2025-12-0218:022 reply

        The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

        • By airstrike 2025-12-0218:151 reply

          If you and others have any insights to share on structuring that benchmark, I'm all ears.

          There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

          The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

          • By pants2 2025-12-0220:00

            Generally, the easiest:

            1. Sample a set of prompts / answers from historical usage.

            2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.

            3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.

            4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.

        • By dotancohen 2025-12-0312:421 reply

          How do you find and decide which obscure models to test? Do you manually review the model card for each new model on Hugging Face? Is there a better resource?

          • By pants2 2025-12-054:491 reply

            Just grab the top ~30 models on OpenRouter[1] and test them all. If that's too expensive make a sample 'screening' benchmark that's just a few of the hardest problems to see if it's even worth the full benchmark.

            1. https://openrouter.ai/models?order=top-weekly&fmt=table

            • By dotancohen 2025-12-057:591 reply

              Thank you! I'll see about building a test suite.

              Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.

              Advice welcome!

              • By pants2 2025-12-0516:531 reply

                Yeah - things are easy when you can objectively score an output, otherwise as you said you'll probably need another LLM to score it. For summaries you can try to make that somewhat more objective, like length and "8/10 key points are covered in this summary."

                This is a real training method (like Group Relative Policy Optimization), so it's a legitimate approach.

                • By dotancohen 2025-12-0517:301 reply

                  Thank you. I will google Group Relative Policy Optimization to learn about that and the other training methods. If you have any resources handy that I should be reading, that would be appreciated as well. Have a great weekend.

                  • By pants2 2025-12-0519:44

                    Nothing off the top of my head! If you find anything good let me know. GRPO is a training technique likely not exactly what you'd do for benchmarking, but it's interesting to read about anyway. Glad I cuold help

      • By Legend2440 2025-12-0220:53

        I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

        The only exception I can think of is models trained on synthetic data like Phi.

      • By pembrook 2025-12-0219:191 reply

        If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

        Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)

        • By astrange 2025-12-0318:16

          Americans have an opposing bias via the phenomenon of "safe edgy", where for obvious reasons they're uncomfortable with being biased towards anyone who looks like a US minority, and redirect all that energy towards being racist to the French. So it's all balanced.

    • By mentalgear 2025-12-0217:281 reply

      Thanks for sharing your use case of the mistral models, which are indeed top-notch ! I had a look at phrasing.app, and while a nice website, I found the copy of "Hand-crafted. Phrasing was designed & developed by humans, for humans." somewhat of a false virtue given your statements here of advanced lllm usage.

      • By barrell 2025-12-0217:352 reply

        I don't see the contention. I do not use llms in the design, development, copywriting, marketing, blogging, or any other aspect of the crafting of the application.

        I labor over every word, every button, every line of code, every blog post. I would say it is as hand-crafted as something digital can be.

        • By willlma 2025-12-075:13

          It's interesting. I've been tinkering with an article summarizing/highlighting browser extension, and realized that I don't want the end-user to have read AI-generated content because it's not as high-quality as I'd hoped. But on the flip side, I'm loving having the AI write most of the code for me.

        • By basilgohar 2025-12-0217:461 reply

          I admire and respect this stance. I have been very AI-hesitant and while I'm using it more and more, I have spaces that I want to definitely keep human-only, as this is my preference. I'm glad to hear I'm not the only one like this.

          • By barrell 2025-12-0218:06

            Thank you :) and you're definitely not the only one.

            Full transparency, the first backend version of phrasing was 'vibe-coded' (long before vibe coding was a thing). I didn't like the results, I didn't like the experience, I didn't feel good ethically, and I didn't like my own development.

            I rewrote the application (completely, from scratch, new repo new language new framework) and all of the sudden I liked the results, I loved the process, I had no moral qualms, and I improved leaps and bounds in all areas I worked on.

            Automation has some amazing use cases (I am building an automation product at the end of the day) but so does doing hard things yourself.

            Although most important is just to enjoy what you do; or perhaps do something you can be proud of.

    • By metadat 2025-12-0216:311 reply

      Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

      Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.

      • By barrell 2025-12-0216:472 reply

        Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

        I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

        Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

        This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

        I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

        • By data-ottawa 2025-12-0217:281 reply

          With gpt5 did you try adjusting the reasoning level to "minimal"?

          I tried using it for a very small and quick summarization task that needed low latency and any level above that took several seconds to get a response. Using minimal brought that down significantly.

          Weirdly gpt5's reasoning levels don't map to the OpenAI api level reasoning effort levels.

          • By barrell 2025-12-0218:25

            Reasoning was set to minimal and low (and I think I tried medium at some point). I do not believe the timeouts were due to the reasoning taking to long, although I never streamed the results. I think the model just fails often. It stops producing tokens and eventually the request times out.

        • By barbazoo 2025-12-0216:501 reply

          Hard to gauge what gibberish is without an example of the data and what you prompted the LLM with.

          • By barrell 2025-12-0216:592 reply

            If you wanted examples, you needed only ask :)

            These are screenshots from that week: https://x.com/barrelltech/status/1995900100174880806

            I'm not going to share the prompt because (1) it's very long (2) there were dozens of variations and (3) it seems like poor business practices to share the most indefensible part of your business online XD

            • By barbazoo 2025-12-0217:451 reply

              Surely reads like someone's brain transformed into a tree :)

              Impressive, I haven't seen that myself yet, I've only used 5 conversationally, not via API yet.

              • By barrell 2025-12-0218:11

                Heh it's a quote from Archer FX (and admittedly a poor machine translation, it's a very old expression of mine).

                And yes, this only happens when I ask it to apply my formatting rules. If you let GPT format itself, I would be surprised if this ever happens.

            • By sandblast 2025-12-0217:07

              XD XD

    • By acuozzo 2025-12-0218:081 reply

      I have a need to remove loose "signature" lines from the last 10% of a tremendous e-mail dataset. Based on your experience, how do you think mistral-3-medium-0525 would do?

      • By barrell 2025-12-0218:181 reply

        What's your acceptable error rate? Honestly ministral would probably be sufficient if you can tolerate a small failure rate. I feel like medium would be overkill.

        But I'm no expert. I can't say I've used mistral much outside of my own domain.

        • By acuozzo 2025-12-0219:151 reply

          I'd prefer for the error rate to be as close to 0% as possible under the strict requirement of having to use a local model. I have access to nodes with 8xH200, but I'd prefer to not tie those up with this task. I'd, instead, prefer to use a model I can run on an M2 Ultra.

          • By barrell 2025-12-0219:50

            If I cannot tolerate a failure rate, I do not use LLMs (or and ML models).

            But in that case the larger the better. If mistral medium can run on your M2 Ultra then it should be up to the task. Should eek out ministral and be just shy of the biggest frontier models.

            But I wouldn’t even trust GPT-5 or Claude Opus or Gemini 3 Pro to get close to a zero percent success rate, and for a task such as this I would not expect mistral medium to outperform the big boys

    • By mackross 2025-12-0312:32

      Cool app. I couldn’t see a way to report an error in one of the default expressions.

  • By msp26 2025-12-0216:333 reply

    The new large model uses DeepseekV2 architecture. 0 mention on the page lol.

    It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3".

    ---

    vllm/model_executor/models/mistral_large_3.py

    ```

    from vllm.model_executor.models.deepseek_v2 import DeepseekV3ForCausalLM

    class MistralLarge3ForCausalLM(DeepseekV3ForCausalLM):

    ```

    "Science has always thrived on openness and shared discovery." btw

    Okay I'll stop being snarky now and try the 14B model at home. Vision is good additional functionality on Large.

    • By Jackson__ 2025-12-0221:011 reply

      So they spent all of their R&D to copy deepseek, leaving none for the singular novel added feature: vision.

      To quote the hf page:

      >Behind vision-first models in multimodal tasks: Mistral Large 3 can lag behind models optimized for vision tasks and use cases.

      • By Ey7NFZ3P0nzAe 2025-12-0221:12

        Well, behind "models" not "langual models".

        Of course models purely made for image stuff will completely wipe it out. The vision language models are useful for their generalist capabilities

    • By make3 2025-12-0219:50

      Architecture difference wrt vanilla transformers and between modern transformers are a tiny part of what makes a model nowadays

    • By halJordan 2025-12-0221:43

      I don't think it's fair to demand everything be open and then get mad when they open-ness is used. It's an obsessive and harmful double standard.

  • By simonw 2025-12-0217:412 reply

    The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPU

    Pelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/

    • By troyvit 2025-12-0219:381 reply

      I'm reading this post and wondering what kind of crazy accessibility tools one could make. I think it's a little off the rails but imagine a tool that describes a web video for a blind user as it happens, not just the speech, but the actual action.

    • By user_of_the_wek 2025-12-037:35

      > The image depicts and older man...

      Ouch

HackerNews