Ollama Turbo

2025-08-0518:46430243ollama.com

Get up and running with large language models.

  • Turbo speed

    Take the load of running models off your Mac, Windows or Linux computer, giving you performance back for your other apps.


Read the original article

Comments

  • By extr 2025-08-0519:492 reply

    Nice release. Part of the problem right now with OSS models (at least for enterprise users) is the diversity of offerings in terms of:

    - Speed

    - Cost

    - Reliability

    - Feature Parity (eg: context caching)

    - Performance (What quant level is being used...really?)

    - Host region/data privacy guarantees

    - LTS

    And that's not even including the decision of what model you want to use!

    Realistically if you want to use an OSS model instead of the big 3, you're faced with evalutating models/providers across all these axes, which can require a fair amount of expertise to discern. You may even have to write your own custom evaluations. Meanwhile Anthropic/OAI/Google "just work" and you get what it says on the tin, to the best of their ability. Even if they're more expensive (and they're not that much more expensive), you are basically paying for the priviledge of "we'll handle everything for you".

    I think until providers start standardizing OSS offerings, we're going to continue to exist in this in-between world where OSS models theoretically are at performance parity with closed source, but in practice aren't really even in the running for serious large scale deployments.

    • By coderatlarge 2025-08-0521:482 reply

      true but ignores handing over all your prompt traffic without any real legal protections as sama has pointed out:

      [1] https://californiarecorder.com/sam-altman-requires-ai-privil...

      • By I_am_tiberius 2025-08-061:32

        I wouldn't be surprised if those undeleted chats or some inferred data that is based on it is part of the gpt-5 training data. Somehow I don't trust this sama guy at all.

      • By supermatt 2025-08-0522:015 reply

        > OpenAI confirmed it has been preserving deleted and non permanent person chat logs since mid-Might 2025 in response to a federal court docket order

        > The order, embedded under and issued on Might 13, 2025, by U.S. Justice of the Peace Decide Ona T. Wang

        Is this some meme where “may” is being replaced with “might”, or some word substitution gone awry? I don’t get it.

        • By SickOfItAll 2025-08-1019:39

          Clearly the author wrote the article with multiple uses of "may" and then used find/replace to change to "might" without proofreading.

        • By wkat4242 2025-08-063:07

          Yeah noticed this too. Really weird for a professional publication

        • By kekebo 2025-08-0522:22

          :)) Apparently. I don't have a better guess. Well spotted

        • By beowulfey 2025-08-0615:50

          auto correct gone awry

        • By mattmaroon 2025-08-0612:381 reply

          Or May in another language?

          • By davidron 2025-08-081:36

            Or non native English speaker who pronounces "may" the same as "might" and didn't realize the difference?

            It is maybe not coincidental that "may" and "might" mean nearly the same thing which bolsters the case for auto correct gone awry.

    • By wkat4242 2025-08-063:05

      Gpt-oss comes only in 4.5 bit quant. This is the native model, so there's no fp16 original

  • By jnmandal 2025-08-0520:114 reply

    I see a lot of hate for ollama doing this kind of thing but also they remain one of the easiest to use solutions for developing and testing against a model locally.

    Sure, llama.cpp is the real thing, ollama is a wrapper... I would never want to use something like ollama in a production setting. But if I want to quickly get someone less technical up to speed to develop an LLM-enabled system and run qwen or w/e locally, well then its pretty nice that they have a GUI and a .dmg to install.

    • By mchiang 2025-08-0520:213 reply

      Thanks for the kind words.

      Since the new multimodal engine, Ollama has moved off of llama.cpp as a wrapper. We do continue to use the GGML library, and ask hardware partners to help optimize it.

      Ollama might look like a toy and what looks trivial to build. I can say, to keep its simplicity, we go through a deep amount of struggles to make it work with the experience we want.

      Simplicity is often overlooked, but we want to build the world we want to see.

      • By dcreater 2025-08-0522:263 reply

        But Ollama is a toy, it's meaningful for hobbyists and individuals to use locally like myself. Why would it be the right choice for anything more? AWS, vLLM, SGLang etc would be the solutions for enterprise

        I knew a startup that deployed ollama on a customers premises and when I asked them why, they had absolutely no good reason. Likely they did it because it was easy. That's not the "easy to use" case you want to solve for.

        • By mchiang 2025-08-066:54

          I can say trying many inference tools after the launch, many do not have the models implemented well, and especially OpenAI’s harmony.

          Why does this matter? For this specific release, we benchmarked against OpenAI’s reference implementation to make sure Ollama is on par. We also spent a significant amount of time getting harmony implemented the way intended.

          I know vLLM also worked hard to implement against the reference and have shared their benchmarks publicly.

        • By jnmandal 2025-08-062:261 reply

          Honestly, I think it just depends. A few hours ago I wrote I would never want it for a production setting but actually if I was standing something up myself and I could just download headless ollama and know it would work. Hey, that would also be fine most likely. Maybe later on I'd revisit it from a devops perspective, and refactor deployment methodology/stack, etc. Maybe I'd benchmark it and realize its fine actually. Sometimes you just need to make your whole system work.

          We can obviously disagree with their priorities, their roadmap, the fact that the client isn't FOSS (I wish it was!), etc but no one can say that ollama doesn't work. It works. And like mchiang said above: its dead simple, on purpose.

          • By dcreater 2025-08-064:032 reply

            But its effectively equally easy to do the same with llama.cpp, vllm or modular..

            (any differences are small enough that they either shouldn't cause the human much work or can very easily be delegated to AI)

            • By evilduck 2025-08-0614:41

              Llama.cpp is not really that easy unless you're supported by their prebuilt binaries. Go to the llama.cpp GitHub page and find a prebuilt CUDA enabled release for a Fedora based linux distro. Oh there isn't one you say? Welcome to losing an hour or more of your time.

              Then you want to swap models on the fly. llama-swap you say? You now get to learn a new custom yaml based config file syntax that does basically nothing that the Ollama model file already does so that you can ultimately... have the same experience as Ollama but now you've lost hours just to get back to square one.

              Then you need it to start and be ready with the system reboot? Great, now you get to write some systemd services, move stuff into system-level folders, create some groups and users and poof, there goes another hour of your time.

            • By jnmandal 2025-08-0615:11

              Sure but if my some of the development team is using ollama locally b/c it was super easy to install, maybe I don't want to worry about maintaining a separate build chain for my prod env. Many startups are just wrapping or enabling LLMs and just need a running server. Who are we to say what is right use of their time and effort?

      • By leopoldj 2025-08-0615:071 reply

        > Ollama has moved off of llama.cpp as a wrapper. We do continue to use the GGML library

        Where can I learn more about this? llama.cpp is an inference application built using the ggml library. Does this mean, Ollama now has it's own code for what llama.cpp does?

      • By buyucu 2025-08-066:562 reply

        This kind of gaslighting is exactly why I stopped using Ollama.

        GGML library is llama.cpp. They are one and the same.

        Ollama made sense when llama.cpp was hard to use. Ollama does not have value preposition anymore.

        • By mchiang 2025-08-067:172 reply

          It’s a different repo. https://github.com/ggml-org/ggml

          The models are implemented by Ollama https://github.com/ollama/ollama/tree/main/model/models

          I can say as a fact, for the gpt-oss model, we also implemented our own MXFP4 kernel. Benchmarked against the reference implementations to make sure Ollama is on par. We implemented harmony and tested it. This should significantly impact tool calling capability.

          Im not sure if im feeding here. We really love what we do, and I hope it shows in our product, in Ollama’s design and in our voice to our community.

          You don’t have to like Ollama. That’s subjective to your taste. As a maintainer, I certainly hope to have you as a user one day. If we don’t meet your needs and you want to use an alternative project, that’s totally cool too. It’s the power of having a choice.

          • By mark_l_watson 2025-08-0615:05

            Hello, thanks for answering questions here.

            Is there a schedule for adding additional models to Turbo mode plan, in addition to gpt-oss 20/120b? I wanted to try your $20/month Turbo plan, but I would like to be able to experiment with a few other large models.

          • By buyucu 2025-08-0712:04

            This is exactly what I mean by gaslighting.

            GGML is llama.cpp. It it developed by the same people as llama.cpp and powers everything llama.cpp does. You must know that. The fact that you are ignoring it very dishonest.

        • By scosman 2025-08-0614:09

          > GGML library is llama.cpp. They are one and the same.

          Nope…

    • By steren 2025-08-0521:013 reply

      > I would never want to use something like ollama in a production setting.

      We benchmarked vLLM and Ollama on both startup time and tokens per seconds. Ollama comes at the top. We hope to be able to publish these results soon.

      • By ekianjo 2025-08-0521:40

        you need to benchmark against llama.cpp as well.

      • By apitman 2025-08-0522:031 reply

        Did you test multi-user cases?

        • By jasonjmcghee 2025-08-067:48

          Assuming this is equivalent to parallel sessions, I would hope so, this is like the entire point of vLLM

      • By sbinnee 2025-08-069:27

        vllm and ollama assume different settings and hardware. Vllm backed by the paged attention expect a lot of requests from multiple users whereas ollama is usually for single user on a local machine.

    • By romperstomper 2025-08-074:01

      It is weird but when I tried new gpt-oss:b20 model locally llama.cpp just failed instantly for me. At the same time under ollama it worked (very slow but anyway). I didn't find how to deal with llama.cpp but ollama definitely doing something under the hood to make models work.

    • By miki123211 2025-08-067:11

      > I would never want to use something like ollama in a production setting

      If you can't get access to "real" datacenter GPUs for any reason and essentially do desktop, clientside deploys, it's your best bet.

      It's not a common scenario, but a desktop with a 4090 or two is all you can get in some organizations.

  • By moralestapia 2025-08-0519:354 reply

    Ollama is great but I feel like Georgi Gerganov deserves way more credit for llama.cpp.

    He (almost) single-handedly brought LLMs to the masses.

    With the latest news of some AI engineers' compensation reaching up to a billion dollars, feels a bit unfair that Georgi is not getting a much larger slice of the pie.

    • By mrs6969 2025-08-0519:482 reply

      Agreed. Ollama itself is kind a wrapper around llamacpp anyway. Feel like the real guy is not included to the process.

      Now I am going to go and write a wrapper around llamacpp, that is only open source, truly local.

      How can I trust ollama to not to sell my data.

      • By Patrick_Devine 2025-08-0519:54

        Ollama only uses llamacpp for running legacy models. gpt-oss runs entirely in the ollama engine.

        You don't need to use Turbo mode; it's just there for people who don't have capable enough GPUs.

      • By rafram 2025-08-0519:541 reply

        Ollama is not a wrapper around llama.cpp anymore, at least for multimodal models (not sure about others). They have their own engine: https://ollama.com/blog/multimodal-models

        • By iphone_elegance 2025-08-0523:19

          looks like the backend is ggml, am I missing something? same diff

    • By benreesman 2025-08-094:10

      `ggerganov` is one of the most under-rated and under-appreciated hackers maybe ever. His name belongs next to like Carmack and other people who made a new thing happen on PCs. And don't forget the shout out to `TheBloke` who like single-handedly bootstrapped the GGUF ecosystem of useful model quants (I think he had a grant from pmarca or something like that, so props to that too).

    • By freedomben 2025-08-0519:412 reply

      Is Georgi landing any of those big-time money jobs? I could see a conflict-of-interest given his involvment with llama.cpp, but I would think he'd be well positioned for something like that

      • By apwell23 2025-08-0519:55

        https://ggml.ai/

        > ggml.ai is a company founded by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding.

      • By moralestapia 2025-08-0519:481 reply

        (This is mere speculation)

        I think he's happy doing his own thing.

        But then, if someone came in with a billion ... who wouldn't give it a thought?

        • By webdevver 2025-08-0519:50

          really a billion bucks is far too much, that is beyond the curve.

          $50M, now thats just perfect. you're retired, nor burdened with a huge responsibility

    • By am17an 2025-08-062:111 reply

      Seriously, people astroturfing this thread by saying ollama has a new engine. It literally is the same engine that llama.cpp uses and georgi and slaren maintain! VC funding will make people so dishonest and just plain grifters

      • By guipsp 2025-08-0617:58

        No one is astroturfing. You cannot run any model with just GGML. It's a tensor library. Yes, it adds value, but I don't think that saying that ollama also does is unfair.

HackerNews