LM Studio 0.4

2026-01-2818:23239128lmstudio.ai

Server deployment, parallel requests with continuous batching, new REST API endpoint, and refreshed application UI


Read the original article

Comments

  • By syntaxing 2026-01-2819:154 reply

    I’m really excited for lmster and to try it out. It’s essentially what I want from ollama. Ollama has deviated so much from their original core principles. Ollama has been broken and slow to update model support. There’s this “vendor sync” I’ve been waiting (essentially update ggml) for weeks.

    • By Imustaskforhelp 2026-01-2913:091 reply

      LMStudio is great but its still not open source. I wish something better than Ollama can be created honestly similar to LMStudio (atleast its new CLI Part from what I can tell) and create an open source alternative.

      I think I am fairly technical but I still prefer how Ollama is simple but I know all the complaints about Ollama and I am really just wishing for a better alternative for the most part.

      Maybe just a direct layer on top of vllm or llama.cpp itself?

      • By embedding-shape 2026-01-2914:351 reply

        > Maybe just a direct layer on top of vllm

        My dream would be something like vLLM, but without all the Python mess, packaged as a single binary that has both HTTP server + desktop GUI, and can browse/download models. Llama.cpp is like 70% there, but large performance difference between llama.cpp and vLLM for the models I use.

        • By Imustaskforhelp 2026-01-2921:39

          > My dream would be something like vLLM, but without all the Python mess, packaged as a single binary that has both HTTP server + desktop GUI, and can browse/download models. Llama.cpp is like 70% there, but large performance difference between llama.cpp and vLLM for the models I use.

          To be honest, I was seeing your comment multiple times and after 6 hours, It suddenly clicked about something new.

          I had seen this project on reddit once, https://github.com/GeeeekExplorer/nano-vllm

          It's almost as fast (from what I can tell in its readme, faster?) than vllm itself but unfortunately its written in python too.

          But the good news is that its much smaller in the whole size of the codebase. Let me paste somethings from its readme

               Fast offline inference - Comparable inference speeds to vLLM
               Readable codebase - Clean implementation in ~ 1,200 lines of Python code
               Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
          
          
          Inference Engine Output Tokens Time (s) Throughput (tokens/s) vLLM 133,966 98.37 1361.84 Nano-vLLM 133,966 93.41 1434.13

          So I guess I am pretty sure that you can one-agent-one-human it from python to rust/golang! It can be an open project.

          Also speaking of oaoh (as I have started calling it), a bit offtopic but my golang port faces multiple issues as I tried today to make it work. I do feel like rust was a good lang because quite frankly the AI agent or anything instead of wanting to do things with its own hands, really wants to end up wanting/wishing to use Fyne library & the best success I had around going against Fyne was in kimi's computer use where you can say that I got a very very (like only simple text) nothing else png file-esque thing working

          If you are interesting emsh. I am quite frankly interested that given that your oaoh project is really high quality. Does it still require the intervention of human itself or can an AI port it itself. Because I have mixed feelings about it.

          Honestly It's an open challenge to everybody. I am just really interested in getting to learn something about how LLM's work and some lesson from this whole thing I guess imo.

          Still trying to create the golang port as we speak haha xD.

    • By PlatoIsADisease 2026-01-2821:093 reply

      What was the original core principle of ollama?

      I had used oobabooga back in the day and found ollama unnecessary.

      • By embedding-shape 2026-01-2911:151 reply

        > What was the original core principle of ollama?

        One decision that was/is very integral to their architecture is trying to copy how Docker handled registries and storage of blobs. Docker images have layers, so the registry could store one layer that is reused across multiple images, as one example.

        Ollama did this too, but I'm unsure of why. I know the author used to work at Docker, but almost no data from weights can be shared in that way, so instead of just storing "$model-name.safetensor/.gguf" on disk, Ollama splits it up into blobs, has it's own index, and so on. For seemingly no gain except making it impossible to share weights between multiple applications.

        I guess business-wise, it was easier for them to now make people use their "cloud models" so they earn money, because it's just another registry the local client connects to. But also means Ollama isn't just about running local models anymore, because that doesn't make them money, so all their focus now is on their cloud instead.

        At least as a LM Studio, llama.cpp and vLLM user, I can have one directory with weights shared between all of them (granted the format of the weight works in all of them), and if I want to use Ollama, it of course can't use that same directory and will by default store things it's own way.

        • By plagiarist 2026-01-2914:481 reply

          I was looking into what local inference software to use and also found this behavior with models to be onerous.

          What I want is to have a directory with models and bind mount that readonly into inference containers. But Ollama would force me to either prime the pump by importing with Modelfiles (where do I even get these?) every time I start the container, or store their specific version of files?

          I had trying out vLLM and llama.cpp as my next step in this, I'm glad to hear you are able to share a directory between them.

      • By d0mine 2026-01-2919:14

        Ollama vs. llama.cpp is like Docker vs. FreeBSD Jails, Dropbox vs. rsync, jujutsu vs git, etc

      • By fud101 2026-01-296:581 reply

        >What was the original core principle of ollama?

        Nothing, it was always going to be a rug pull. They leached off llama.cpp.

        • By garyfirestorm 2026-01-2913:022 reply

          Everyone seems to be missing important piece here. Ollama is/was a one click solution for non technical person to launch a local model. It doesn’t need a lot of configuration, detects Nvidia GPU and starts model inferencing with single command. Core principle being your grandmother should be able to launch local AI model without needing to install 100 dependencies.

    • By azharav 2026-02-0111:47

      Sh N E Z A R Sh0997585 699

  • By tarruda 2026-01-2911:413 reply

    These days I don't feel the need to use anything other than llama.cpp server as it has a pretty good web UI and router mode for switching models.

    • By roger_ 2026-01-2913:55

      MLX support on Macs was the main reason for me.

    • By embedding-shape 2026-01-2914:33

      I mostly use LM Studio for browsing and downloading models, testing them out quickly, but then actually integrating them is always with either llama.cpp or vLLM. Curious to try out their new cli though and see if it adds any extra benefits on top of llama.cpp.

    • By mycall 2026-01-2914:352 reply

      Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.

      • By embedding-shape 2026-01-2914:39

        Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.

        Both have their places and are complementary, rather than competitors :)

      • By tarruda 2026-01-2916:321 reply

        I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.

  • By minimaxir 2026-01-2819:061 reply

    LMStudio introducing a command line interface makes things come full circle.

    • By Helithumper 2026-01-2819:241 reply

      For context, LMStudio has had a CLI for a while it just required the desktop app to be open already. This makes it where you can run LMStudio properly headless and not just from a terminal while the desktop app is open.

      `lms chat` has existed, `lms daemon up` / "llmster" is the new command.

      • By embedding-shape 2026-01-2819:48

        > This makes it where you can run LMStudio properly headless and not just from a terminal while the desktop app is open

        Ah, this is great, been waiting for this! I naively created some tooling on top of the API from the desktop app after seeing they had a CLI, then once I wanted to deploy and run it on a server, I got very confused that the desktop app actually installs the CLI and it requires the desktop app running.

        Great that they finally got it working fully headless now :)

HackerNews