Can I run AI locally?

Comments

By mark_l_watson 2026-03-1318:204 reply

I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

By johnmaguire 2026-03-1318:543 reply

I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.

By philipkglass 2026-03-1319:00

It really depends on the tasks you have to perform. I am using specialized OCR models running locally to extract page layout information and text from scanned legal documents. The quality isn't perfect, but it is really good compared to desktop/server OCR software that I formerly used that cost hundreds or thousands of dollars for a license. If you have similar needs and the time to try just one model, start with GLM-OCR.

If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be an exercise in frustration if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to describing and categorizing unstructured data.

By Bluecobra 2026-03-1319:301 reply

I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!

By AzN1337c0d3r 2026-03-1319:35

Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).

There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.

Only Apple has the unique dynamic allocation though.

By saltwounds 2026-03-1319:11

I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not

By manmal 2026-03-1318:50

What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?

By kylehotchkiss 2026-03-1318:53

I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?

By nine_k 2026-03-1318:291 reply

What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.

By fzzzy 2026-03-1319:40

A Mac Pro with 512 gb unified ram does not exist.

By meatmanek 2026-03-1317:254 reply

This seems to be estimating based on memory bandwidth / size of model, which is a really good estimate for dense models, but MoE models like GPT-OSS-20b don't involve the entire model for every token, so they can produce more tokens/second on the same hardware. GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)

By lambda 2026-03-1317:44

Yeah, I looked up some models I have actually run locally on my Strix Halo laptop, and its saying I should have much lower performance than I actually have on models I've tested.

For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.

By tommy_axle 2026-03-1318:36

I'm guessing this is also calculating based on the full context size that the model supports but depending on your use case it will be misleading. Even on a small consumer card with Qwen 3 30B-A3B you probably don't need 128K context depending on what you're doing so a smaller context and some tensor overrides will help. llama.cpp's llama-fit-params is helpful in those cases.

By pbronez 2026-03-1317:501 reply

The docs page addresses this:

> A Mixture of Experts model splits its parameters into groups called "experts." On each token, only a few experts are active — for example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.

> A dense model activates all its parameters for every token — what you see is what you get. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory/speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.

https://www.canirun.ai/docs

By lambda 2026-03-1319:00

It discusses it, and they have data showing that they know the number of active parameters on an MoE model, but they don't seem to use that in their calculation. It gives me answers far lower than my real-world usage on my setup; its calculation lines up fairly well for if I were trying to run a dense model of that size. Or, if I increase my memory bandwidth in the calculator by a factor of 10 or so which is the ratio between active and total parameters in the model, I get results that are much closer to real world usage.

By littlestymaar 2026-03-1317:461 reply

While your remark is valid, there's two small inaccuracies here:

> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).

Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).

By lambda 2026-03-1319:091 reply

So, this is all true, but this calculation isn't that nuanced. It's trying to get you into a ballpark range, and based on my usage on my real hardware (if I put in my specs, since it's not in their hardware list), the results are fairly close to my real experience if I compensate for the issue where it's calculating based on total params instead of active.

So by doing so, this calculator is telling you that you should be running entirely dense models, and sparse MoE models that maybe both faster and perform better are not recommended.

By littlestymaar 2026-03-1319:12

I agree, and I even started my response expressing my agreement with the whole point.

But since this is a tech forum, I assumed some people would be interested by the correction on the details that were wrong.

By mopierotti 2026-03-1318:442 reply

This (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:

"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"

(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.

I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.

By J_Shelby_J 2026-03-1318:52

It’s a hard problem. I’ve been working on it for the better part of a year.

Well, granted my project is trying to do this in a way that works across multiple devices and supports multiple models to find the best “quality” and the best allocation. And this puts an exponential over the project.

But “quality” is the hard part. In this case I’m just choosing the largest quants.

By downrightmike 2026-03-1319:29

LLMs are just special purpose calculators, as opposed to normal calculators which just do numbers and MUST be accurate. There aren't very good ways of knowing what you want because the people making the models can't read your mind and have different goals

Hacker News