This is a good time to promote running your own models. I have been running my own models locally and I would wager a local model will meet 85-95% of your needs if you really learn to use it. These models have gotten great. For anyone wanting to get into this, the smartest models to run recently that is consumer friendly was just released, checkout Qwen3.5 the 27B and 35B variants. They are small and I recommend running full Q8 quants. The easiest way to run these without dealing with complex GPU is to get a mac. For the example I gave, a 64gb mac will handle it well. If you are really cash strapped then you can manage with a 32gb but will have to run with less resolution quants. If you are not cashed strap, then get at least a 128gb and if possible a 256gb. The models are so good you will regret not getting a better system. You can join the r/LocalLlama community in reddit to learn some more. But this is pretty easy. Grab llama.cpp, grab a gguf quant from huggingface.co - the unsloth quants are great - https://huggingface.co/unsloth/models
For non-Mac users:
A laptop with an iGPU and loads of system RAM has the advantage of being able to use system ram in addition to VRAM to load models (assuming your gpu driver supports it, which most do afaik), so load up as much system RAM as you can. The downside is, the system RAM is less fast than dedicated GDDR5. These GPUs would be Radeon 890M and Intel Arc (previous generations are still decently good, if that's more affordable for you).
A laptop with a discrete GPU will not be able to load models as large directly to GPU, but with layer offloading and a quantized MoE model, you can still get quite fast performance with modern low-to-medium-sized models.
Do not get less than 32GB RAM for any machine, and max out the iGPU machine's RAM. Also try to get a bigass NVMe drive as you will likely be downloading a lot of big models, and should be using a VM with Docker containers, so all that adds up to steal away quite a bit of drive space.
Final thought: before you spend thousands on a machine, consider that there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now. Do the math before you purchase a machine; unless you are doing 24/7/365 inference, the cloud is fastly more cost effective.
> there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now.
Oh yeah, seems obvious now you said it, but this is a great point.
I'm constantly thinking "I need to get into local models but I dread spending all that time and money without having any idea if the end result would be useful".
But obviously the answer is to start playing with open models in the cloud!
Well they are doing that because of the nature of matrix multiplication. Specifically, LLM costs scale in the square length of a single input, let's call it N, but only linearly in the number of batched inputs.
O(M * N^2 * d)
d is a constant related to the network you're running. Batching, btw, is the reason many tools like Ollama require you to set the context length before serving requests.
Having many more inputs is way cheaper than having longer inputs. In fact, that this is the case is the reason we went for LLMs in the first place: as this allows training to proceed quickly, batching/"serving many customers" is exactly what you do during training. GPUs came in because taking 10k triangles, and then doing almost the exact same calculation batched 1920*1080 times on them is exactly what happens behind the eyes of Lara Croft.
And this is simplified because a vector input (ie. M=1) is the worst case for the hardware, so they just don't do it (and certainly not in published benchmark results). Often even older chips are hardwired to work with M set to 8 (and these days 24 or 32) for every calculation. So until you hit 20 customers/requests at the same time, it's almost entirely free in practice.
Hence: the optimization of subagents. Let's say you need an LLM to process 1 million words (let's say 1 word = 1 token for simplicity)
O(1 million words in one go) ~ 1e12 or 1 trillion operations
O(1000 times 1000 words) ~ 1e9 or 1 billion operations
O(10000 times 100 words) ~ 1e8 or 100 million operations
O(100000 times 10 words) ~ 1e7 or 10 million operations
O(one word at a time) ~ 1e6 or 1 million operations
Of course, to an extent this last way of doing things is the long known case of a recurrent neural network. Very difficult to train, but if you get it working, it speeds away like professor Snape confronted with a bar of soap (to steal a Harry Potter joke)
I agree but I still have that itch to have my own local model—so it's not always about cost. A hobby?
(Besides, a hopped-up Mac would never go to waste in my home if it turns out the local LLM thing was not worth the cost.)
> there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud
Do you have some links?
Also I assume the privacy implications are vastly different compared to running locally?
Throw a rock and you'll hit one... Groq (not Grok, elon stole the name), Mistral, SiliconFlow, Clarifai, Hyperbolic, Databricks, Together AI, Fireworks AI, CompactifAI, Nebius Base, Featherless AI, Hugging Face (they do inference too), Cohere, Baseten, DeepInfra, Fireworks AI, DeepSeek, Novita AI, OpenRouter, xAI, Perplexity Labs, AI21, OctoAI, Reka, Cerebras, Fal AI, Nscale, OVHcloud AI, Public AI, Replicate, SambaNova, Scaleway, WaveSpeedAI, Z.ai, GMI Cloud, Nebius, Tensorwave, Lamini, Predibase, FriendliAI, Shadeform, Qualcomm Cloud, Alibaba Cloud AI, Poe, Bento LLM, BytePlus ModelArk, InferenceAI, IBM Wastonx.AI, AWS Bedrock, Microsoft, Google
I use Ollama Cloud. $20/mo and I never come close to hitting quota (YMMV obviously).
They don't log anything, and they use US datacenters.
for privacy preserving direct inference: Fireworks ai nebius
otherwise openrouter for routing to lots of different providers.
openrouter, for example, there are models both open and closed
And if you don't want to buy a Mac? A 80 GB NVidia GPU costs $10,000K (equivalent to 30 years of ChatGPT Plus subscription) and will probably be obsolete in 5-7 years anyway. What are my options if I want a decent coding agent at a reasonable price?
I downloaded Ollama ( https://github.com/ollama/ollama/releases ) and experimented with a few Qwen models ( https://huggingface.co/Qwen/collections ).
My performance when using an RTX 5070 12GiB VRAM, Ryzen 7 9700X 8 cores CPU, 32GiB DDR5 6000MT (2 sticks):
- "qwen2.5:7b": ~128 tokens/second (this model fits 100% in the VRAM).
- "qwen2.5:32b": ~4.6 tokens/second.
- "qwen3:30b-a3b": ~42 tokens/second (this is a MoE model with multiple specialized "brains") (this uses all 12GiB VRAM + 9GiB system RAM, but the GPU usage during tests is only ~25%).
- qwen3.5:35b-a3b: ~17 tokens/second, but it's highly unstable and crashes -> currently not usable for me.
So currently my sweet spot is "qwen3:30b-a3b" - even if the model doesn't completely fit on the GPU it's still fast enough. "qwen3.5" was disappointing so far, but maybe things will change in the future (maybe Ollama needs some special optimizations for the 3.5-series?).I would therefore deduce that the most important thing is the amount of VRAM and that performance would be similar even when using an older GPU (e.g. an RTX 3060 with as well 12GiB RAM)?
Performance without a GPU, tested by using a Ryzen 9 5950X 16 cores CPU, 128GiB DDR4 3200 MT:
- "qwen2.5:7b": ~9 tokens/second
- "qwen3:32b": ~2 tokens/second
- "qwen3:30b-a3b": ~16 tokens/secondI'm able to run the Unsloth quants on an ancient dual socket Xeon 1U server I keep around for homelab stuff. It has 8 DDR3 channels, which gives me about as much memory bandwidth as two channels of DDR5 :-/ But 16 sockets and cheaper prices. So it has 256gb in it right now. I have to run the minimum size Unsloth quant for the largest open weight models. They definitely feel a bit dazed. This machine can support up to 1.5TB of DDR3, which would allow me to run many of the largest models unquantized, but at 1/4 of the already abysmal speeds I see of ~ 1 Token / s which is only really usable with multiple agents running a kanban style async development process. Nothing interactive. That said, I picked up the hardware at the local surplus for $25 and it's vintage ~2010. Pretty impressive what this enterprise gear can do.
Power consumption? Don't ask. A subscription is cheaper.
> Power consumption
That’a the thing, at the end of it all power consumption will matter more for the end-user who doesn’t have money to burn away, because I suspect that power-consumption will, in the majority of cases, exceed the price of the HW itself in a matter of just a few months of intense use, let’s say a year.
Assuming models of a fixed size continue to improve in capability, continued advancement in semiconductors and optimization will reduce power consumption and/or improve performance over time. And used equipment will always approach the scrap price eventually. For me today, on scrap equipment, I get about 4 tokens / watt-hour, which is nominally ~$0.17 US but could run $0.40 after all the taxes and fees and surcharges. $0.10 / token. Ouch.
If I were to try to purpose build a rig for it, I would get an engineering sample Epyc/motherboard/ram combo from Aliexpress with 12 channels of DDR5 and as few cores as allowed me to still use all the memory bandwidth, and I'd run it at the lowest possible power and voltage settings with aggressive ram timings. A system like that can draw 1/3 of what my scrap rig draws, at full load. And has similar memory bandwidth to a high end Mac or GPU allowing it to crank out 5 - 10 Tokens / s on the largest models, which works out to 1/3 of a penny to 2/3 of a penny per token. But either way, Epyc or Mac is going to set you back $10k or more. Hopefully in a few years when they are scrap though...
Rent a H100 on Modal which scales down to zero when not in use - you can set the time out period.
Cold boot times are around 5m but if your usage periods are predictable it can work out ok. Works out at $2 an hour.
Still far more expensive than a ChatGPT sub.
Do you have some reference on what setup you're talking about? I'd like to integrate it into my IDE (cursor/vscode) - are there docs on such a setup?
Start here
https://modal.com/docs/examples/vllm_inference
or give this a go
https://modal.com/docs/examples/opencode_server
You get $30 free credits each month on Modal which is enough to play around (i have no affiliation, just think they run a great service)
GPUs are not going obsolete anytime soon. the nvidia p40/p100 launched in 2016, 10 years ago and is popular in the local space. My first set of GPUs were a bunch of P40s from 3 years ago for $150 a piece. They at one point went up all the way to $450, but price is now down to $200 range. I think I have gotten my value from those and I suspect I'll still have them crunching out tokens for at least 3 more years. They still beat 90% of cpu/memory inference combo.
My point being that no one should be buying expensive GPUs when you can pick up a few used ones to get started. But for the sake of discussion let's say you do get a blackwell pro 6000 that's now going for $10,000. I can assure you it will not be $150 10 years from now, with the falling price of dollar, demand for AI inference and hardware shortage, it might cost exactly the same 10 years from now...
A Strix Halo with 128GB unified memory is less than $2k and the more suitable alternative to a mac. I'm pretty happy with my device (Bosgame M5).
the macs outperform it and I figure it's a better general purpose computer than strix halo. if budget is a problem, then a strix halo is a decent alternative.
Well a mac isn't really an alternative to a mac, or is it? ;)
Personally I'm not interested in having a mac as I work with linux. And yes, they outperform them, but only if you ignore the price. When comparing what you get for ~$2k, a Strix Halo is miles ahead.
Mac doesn't run Linux so in my books is a worse general purpose computer than a Strix Halo box.
A Strix Halo with 128GB unified memory is less than $2k
Where did you get that price? Wherever I looked it's around 3k euros which is around $3.5k
I took my setup from here: https://github.com/kyuz0/amd-strix-halo-toolboxes
Still lot to learn, but after a while you have something like Qwen3-Coder-Next-Q8_0 running and - at least for me - it works quite well, both as ChatGPT like chat-interface using llama.cpp and as coding agent
I'm not really using them for coding (only played a little bit with minimax2.1), which is probably the most common use case here.
I mainly use them for deep work with texts and deep research. My main criterion is privacy, both for legal reasons (I'm in the EU and can't and don't want to expose customer's data to non-gdpr-compliant services) and wouldn't use US services personally either, e.g. I would never explore health related topics chatgpt or gemini for obvious reasons.
Technically I've set it up in my office with llama.cpp and have exposed that (both chat interface and openai compatible api) with a simple wireguard tunnel behind nginx and http auth. Now I can use it everywhere. It's a small, quiet and pretty fast machine (compiling llama.cpp is around 20 seconds?), I quite like it.
What are my options if I want a decent coding agent at a reasonable price?
I'd even come from another angle.. What are my options if I want a decent coding agent, on the level of what Claude does at any given price? Let's say few tens of thousands of dollars? I've had a limited look at what's available to be run locally and nothing is on par.
Does not exist AFAIK. Even other labs struggle with Claude level performance in real world task. My experience is that no open model is close. You can get RTX 6000 Pro Blackwell (Max-Q is better for power is half). I have heard good things about Qwen3 coder next but I could not get tool calling to be high performance but it’s likely to be pebkac.
If you want to spend big bucks get h200 141 GB but honestly RTX 6000 pro is good enough till you know what you want. Workstation edition is good. It takes care of cooling etc.
Tbh even better is to just get model through cloud. If you want you can rent GPU. Then see if it’s what you want.
The gist of it is no matter the money you spend on hardware, you will not get the same quality you get from claude. Main question is then what can you run that's good enough? I haven't tested all there is available, but everything I did see does not come even close.
You can rent GPUs, this comes with a security, maintenance and performance overhead, but also has a few advantages.
But right now, a Mac is the easiest way because of their memory architecture.
Honestly you can run this on a 16GB VRAM GPU with llama.cpp. Just try it!
An even easier way to get into this is simply by downloading a program called LM Studio. You can mount a model and chat to it within 10-15 mins with no experience whatsoever, and no configuration at all.
That said, last time I tried local LLMs (around when gpt-oss came out) it still seemed super gimmicky (or at least niche, I could imagine privacy concerns would be a big deal for some). Very few use cases where you want an LLM but can't benefit immensely from using SOTA models like Claude Opus.
The financial barrier is kind of the opposite of "easy to run" to me.
As much as I love owning my stack, you'd have to use so much of this to break even vs an inference provider/aggregator with open frontier-ish models. (and personally, I want to use as little as possible)
As someone who desperately wants to use local models, I lament there is no way to use them on consumer hardware for serious coding work. I have a rtx 4070 super ti and I cannot run any large model with enough context and tps compared to a remote offering.
I have a 24GB Macbook Pro. I will note, do get the 'Pro' models, the Mac Mini and the Macbook Air do not have internal fans. The Macbook Pro has an internal fan, and the Mac Studio (bigger Mac Mini) has a fan. If you get a Mini, you might want to get one of those docks that cools the Mini. Your hardware will get very hot very quickly.
Also, because Apple in their infinite wisdom despite giving you a fan, very lazily turn it on (I swear it has to hit 100c before it comes on) and they give you zero control over fan settings, you may want to snag something like TG Pro for the Mac. I wound up buying a license for it, this lets you define at which temperature you want to run your fans and even gives you manual control.
On my 24G RAM Macbook Pro I have about 16GB of Inference. I use Zed with LM Studio as the back-end. I primarily just use Claude Code, but as you note, I'm sure if I used a beefier Mac with more RAM I could probably handle way more.
There's a few models that are interesting on the Mac with LM Studio that let you call tooling, so it can read your local files and write and such:
mistralai/mistralai-3-3b this one's 4.49GB - So I can increase my context window for it, not sure if it auto-compacts or not, have only just started testing it
zai-org/glm-4.6v-flash - This one is 7.09GB, same thing, only just started testing it.
mistralai/mistral-3-14b-reasoning - This one is 15.2GB just shy of the max, so not a TON of wiggle room, but usable.
If you're Apple or a company that builds things for Macs or other devices, please build something to help with airflow / cooling for the MBP / Mac Mini, it feels ridiculous that it becomes a 100c device I'm not so sure its great for device health if you want to use inference for longer than the norm.
I will probably buy a new Mac whenever the inference speeds increase at a dramatic enough rate. I sure hope Apple is considering serious options for increasing inference speed.
So is it just like the Pro? Do I need to buy the fan software for my wife's mini too? Ridiculous...
>> I will note, do get the 'Pro' models, the Mac Mini and the Macbook Air do not have internal fans
I have a base model M4 Mac Mini and it absolutely does have a fan inside it.
I must have assumed it did not, since my wife's Mini never sounded off the fan, it was hot beyond the norm to the touch, I stopped using it for inference. If the standard model Minis do have fans, I might reconsider instead of a Studio.
Yeah if you look at this timestamp you can see the fan, to be fair the M4 Pro has a slightly beefier heatsink but both have a fan.
How are the Ryzen 395 with 128gb for running models these days?
No complaints here, I use a Framework Desktop with this chip. 32G given to RAM and the rest plays VRAM. Can use large models like 'gpt-oss:120b' fine. Splurged and got a second SSD for mirroring, hoping to speed up reads/model loads. Haven't tested this for efficacy, but it also gives redundancy. Shrugs!
Haven't paid a subscription in years or even signed up for $EMPLOYER offerings; handles the rare outsourcing well enough.
Or you can get a strix halo from AMD. They run about $2k from various Chinese brands, or a bit more from Framework. 128GBs of unified RAM are plenty for most models, although memory bandwidth is slower than in a mac.
I really hope at some point in the near future AI models shrink enough or laptops get strong enough to run AI models locally. I haven't tried in the past year, but when I did it was very slow token output + laptop was on fire to make that happen.
I've wanted to try some of the more recent 8B models for local tab completion or agentic, any experience with those kinds of smaller models?
I've been running local language models on an existing laptop with 8GB GPU, currently using ministral-3:8b. It's faster than other models of similar size I used previously, fast enough that I never wait for it, rather have to scroll back to read the full output.
So far I'm using it conversationally, and scripting with tools. I wrote a simple chat interface / REPL in the terminal. But it's not integrated with code editor, nor agentic/claw-like loops. Last time I tried an open-source Codex-like thing, a popular one but I forget its name, it was slow and not that useful for my coding style.
It took some practice but I've been able to get good use out of it, for learning languages (human and programming), translation, producing code examples and snippets, and sometimes bouncing ideas like a rubber-duck method.
qwen3-8b is good and if you are doing tab completion then it's more than adequate. you can get basic agentic with it, but if you really want to use a serious agent and do some serious work, then at the very least qwen3.5-27B if you have a 5090 32gb vram GPU or qwen3.5-35-a3b if you have less than 24gb. if you want to use a laptop, get a laptop with a built in gpu or igpu.
> NTransformer High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely.
untested:
I had some luck with Ollama + Mistral Nemo models on consumer hardware, it seemed to punch above its "weight class". But it’s still far enough behind ChatGPT et al. that I couldn’t stop using that for real work.
I have a lenovo workstation with 256GB ram but a weak sauce 12GB VRAM GPU. Is there any DMA trick to improve offload performance?
Things such as AirLLM, or good old llama.cpp.
use llama.cpp, you will be surprised how fast a model like qwen3.5-35b-a3b will run. that a3b means only 3B active parameter, so while infering the entire 3B will be in your GPU and you will get amazing performance. for your system, you should use the -cmoe option
I've noticed that running models locally is not necessarily easy. I'm currently trying to use Stable Diffusion with Flux2 klein 4b fp4 (because I have a normal GPU and not a specialised setup), and I can't get it to produce anything other than uneven blue.
I haven't tried pure text models, but 27B sounds painful for my system.
Isn't between Q4-Q6 the usual recommendation for quants? Can you explain the Q8 recommendation, as I was under the impression that if you can run a model at Q8, you should probably run a bigger model in Q4 instead
There are no hard rules regarding quants, except less is better.
However models respond very differently, and there are tricks you can do like limiting quantization of certain layers. Some models can genrally behave fine down into sub-Q4 territory, while others don't do well below Q8 at all. And then you have the way it was quantized on top of that.
So either find some actual benchmarks, which can be rare, or you just have to try.
As an example, Unsloth recently released some benchmarks[1] which showed Qwen3.5 35B tolerating quantization very well, except for a few layers which was very sensitive.
edit: Unsloth has a page detailing their updated quantization method here[2], which was just submitted[3].
[1]: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
[2]: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
if you can run Q8, go for it, always go for the best. matters a lot with vision models, never quantizie your kv cache, those always at f16.
you can always try evals and see if you have a q6 or q4 that can perform better than your q8. for smaller models i go q8. for bigger ones when i run out of memory I then go q6/q6/q4 and sometimes q3. i run deepseek/kimi-q4 for example.
I suggest for beginners to start with q8 so they can get the best quality and not be disappointed. it's simple to use q8 if you have the memory, choice fatigue and confusion comes in once you start trying to pick other quants...
The big AI labs are almost certainly selling inference below cost and burning mountains of money. With the insane increase in hardware prices, running models locally just doesn’t make any financial sense.
Nobody is saying it makes "financial sense", it's about control.
I have always taken plenty of care to try and avoid becoming dependent on big tech for my lifestyle. Succeeded in some areas failed in others.
But now AI is a part of so many things I do and I'm concerned about it. I'm dependent on Android but I know with a bit of focus I have a clear route to escape it. Ditto with GMail. But I don't actually know what I'd do tomorrow if Gemini stopped serving my needs.
I think for those of us that _can_ afford the hardware it is probably a good investment to start learning and exploring.
One particular thing I'm concerned about is that right now I use AI exclusively through the clients Google picked for me, coz it makes financial sense. (You don't seem to get free bubble money if you buy tokens via API billing, only consumer accounts). This makes me a bit of a sheep and it feels bad. There's so much innovation happening and basically I only benefit from it in the ways Google chooses.
(Admittedly I don't need local models to fix that particular issue, maybe I should just start paying the actual cost for tokens).
Apparently inference itself is profitable, at least according to an interview I watched with Dario. They even cover the cost of training itself, if you look at it on a model-by-model basis.
The cash burn comes from models ballooning in size - they spend (as an example, not actual numbers) 100M on training + inference for the lifetime of Sonnet 3.5, make 200M from subscriptions/api keys while it's SOTA, but then have to somehow come up with 1B to train Opus 4.0.
To run some other back of the envelope calcs: GLM 4.7 Air (previous "good" local LLM) can generate ~70 tok/s on a Mac Mini. This equates to 2,200 million tokens per year.
Openrouter charge $0.40 per million tokens, so theoretically if you were using that Mac mini at 100% utilisation you'd be generating $880 per annum "worth" of API usage.
Assuming a power draw of something 50W, you're only looking at 440kWh per annum. At 20c per kWh that's $90 on power, plus $499 to get the hardware itself. Depreciate that $499 hardware cost over 3 years and you're looking at ~$260 to generate ~$880 in inference income.
We are not in this thread because of finances but because of safety from oppressive governments and bad big corps. It's for you to decide the price of your own safety.
RAM and storage price increases due to the AI bubble have certainly made the cost of entry more expensive, but once you have the hardware, running models locally does make financial sense, especially if you have access to home solar power that is sufficient to run the hardware. You can't get much lower running cost than free.
I just can't help but imagine ChatGPT's sycophancy mixed with military operations. "Sharp insight bombing that wedding! Next would you like tips on mosques to bomb, or I can suggest some new napalm recipes that are extra spicey. Your call!"
Department of Defense: You just bombed the wrong Georgia! The people of Atlanta are furious!
ChatGPT: You're absolutely right, and you're right to call that out. Upon examination it does appear that there might have been a mistake with the coordinates of the bomb. Let's try again, this time we will double check before we launch any missiles! :missile emoji:
You forgot "I will take full responsibility"
:checkmark: :checkmark:
That’s not actually the name of that department. That requires an act of congress.
You can call it what you’d like but I’ll stick with the official name instead of the words of the lolcow we decided to make president.
Not enough emdashes!
If Antrophic would have given in, I would have imagined the dialogs something like in claude CLI:
To complete the mission the war terminal needs to hit a target at XY:
1. yes
2. yes (and don't ask again for strike targets in this session)
3. no
Human in the loop is the term here I think.
(I am really glad they did not give in, but I do assume this is what it will come to anyway)
The point is it will be autonomous, the prompt could just be 'keep me safe' which will be interpreted who knows how and presumably no further prompting.
Autonomous just means this narrative is what you’d see if you looked at the logs of the drone talking to itself in it’s head…?
This is giving me strong Dark Star vibes with the intelligent bomb forcing a philosophical discussion about existence and perception at the end.
(Spoilers for the ending of the movie: https://youtu.be/h73PsFKtIck?si=tTm9TidmEMBHsXq1 )
Assuming it's not smart enough to write logs that make it less likely to be prosecuted/ disabled by coming up with fake reasons.
It can just say you were a terrorist because you were an adult male traveling with something in your hands. Humans already do this to justify strikes, likely the AI would do the same.
> Assuming it's not smart enough to write logs that make it less likely to be prosecuted
Alternatively: Assuming it's smart enough not to consider logging to /dev/null a reasonable way to speed up execution times.
Don't forget to add no melting ghost babies to that prompt!
I think I can guess what training data it used for the wedding droning idea!
reminds me of the lazy gun in against a dark background!
Story time!
I actually cancelled my ChatGPT subscription in late 2024 and documented the process, kind of as a social media thing because it had gotten so bad and I realized nobody in my family was using it anymore. I asked my wife if she was getting any use out of it and she told me she had been using Gemini and Grok for months because "GPT is very lazy now".
After a while another charge came in for the subscription, but I had the receipts: we had cancelled before the next billing cycle. I decided to try and reach out to OpenAI to resolve this, but they only let you chat with GPT itself for this, which it failed at and told me they weren't in the wrong and none of the information matched what actually happened.
I took this and used it to submit a chargeback request with Privacy.com, which I use for all of my online purchases. Normally I don't have to worry about this because I set a limit or cancel the cards I issue manually, but I had an OpenAI API account using the same card and I had been a bit lazy in using the same card for technically two different services.
Well, Privacy.com won that dispute and I got that money back. It's worth mentioning this is actually different than most banks will do now days. For the most part when you try to get a bank to do a chargeback they just roll it into their insurance and refund you the customer as a cost of doing business, but the actual scammer or shady merchant got to keep their stolen money, whereas I can be certain OpenAI didn't keep my money.
Your last statement is false. A shady merchant never gets to keep the stolen money. The card issuer/bank refunds you immediately because of consumer protection laws. But that charge is immediately charged to the processor. The processor then gets the merchant involved in a dispute process. If the merchant loses the processor charges the merchant. One way they do it is to immediately deduct it from their current processed transactions. If the merchant is no longer processing, they will usually go try to claw it back from their bank account if they have no held reserves, and if they can't get it, they send the merchant to collection. At the end the merchant must eat the cost or the processor. So in your case, the bank didn't eat the cost. OpenAI certainly ate the cost and the chargeback fees.
You are incorrect.
Chase uses a "provisional credit" system, but for small amounts, this credit often becomes permanent almost instantly.
Wells Fargo utilizes an automated system called the Wells Fargo Dispute Manager which is also similar.
Technically, it is Self-Insurance. Banks set aside a portion of their interchange revenue (the fees they charge merchants for every swipe) into a "Provision for Credit Losses." They use this pool of money to "buy" customer satisfaction for small errors rather than paying an employee $30/hour to investigate a $12 dispute.
And yet I’ve dealt with $15 (and I believe less) chargebacks on more than on occasion. Chargebacks that Stripe charges me $15 even if I win the dispute.
The banks don’t seem to care one bit about and evidence you do provide anyway, I just imagine their dispute system is just “sleep(10 days); return DENIED;”
> Your last statement is false. A shady merchant never gets to keep the stolen money.
Or any merchant for that matter. Chargebacks (from bad actors) are one of the most annoying things when you sell online when you’re a honest legit business. Stripe even charges you a penalty fee on top of that.
This makes me feel good. My gym was one of those places that lured you in with a low monthly price then heavily upsell you on personal trainers. The trainers were fine, but they were fairly inexperienced (usually college kids who took a course and then wanted a summer job or something). And once they found something better they'd leave. Nothing against the trainers, but I prefer to have consistency with my coaches.
Anyway, when I canceled my membership, I realized a few months later they were still charging me for training sessions. I chatted with the manager and apparently I had to cancel those separately. They refused to refund me for the sessions I never took. But they had a kind offer of being able to get training without paying for membership.
Well, I wasn't buying it, so I went on Chase and initiated chargebacks for every session charged after I canceled.
I dunno what happened on their end, but I got my money back. The business is still around, so I guess I didn't hurt them that much, but it's good to know that I probably was at least a hindrance.
Your credibility is shot when you claim that banks will just give you money. They absolutely do not. In fact, Discover has admitted to me in writing, that they always rule in favor of the Merchant if that Merchant responds to the dispute -- regardless of what their response says.
I've dealt with multiple chargebacks over the years and have only ever lost once -- when the Manager at Lowes' showed a check they wrote me [after I opened the dispute].
They absolutely do not just do anything and "write it off". Please be human and don't just rattle of high-confidence, baseless claims, especially as a giant billboard to Privacy.com
> Discover has admitted to me in writing, that they always rule in favor of the Merchant if that Merchant responds to the dispute -- regardless of what their response says
What, always? Like, literally 100% of the time if the merchant responds at all, they automatically win?
That's very hard to believe. I don't know Discover but I do know Visa and that's not how their system works at all.
I use Amex as much as possible because it’s basically never a fight. If I dispute, I get my money back. Granted, I don’t abuse the power so maybe I’ve earned some trust over the decades.
Wells Fargo, Chase, Capitol One, many others practice this provisional credit system, which functions very similar to an insurance.
Go read your banks terms and you'll find the provision. Do you want me to read your banks terms for you and point them out?
> Well, Privacy.com won that dispute and I got that money back.
Well, it seems like ChatGPT’s automated litigation resolution with Privacy.com got lazy. I wonder how a company with an AI can lose in a dispute instead of smokescreening the opponent with legitimate arguments and legalese.
It helps when you have a video posted on social media the day you cancelled and a video of talking to a clueless AI customer retention system that seems to not agree or understand how time works.
Also, chargeback dispute is limited to 3 rounds of back and fourth by Visa and MasterCard both. They don't get to endlessly come back etc.