I don’t want more conversational, I want more to the point. Less telling me how great my question is, less about being friendly, instead I want more cold, hard, accurate, direct, and factual results.
It’s a machine and a tool, not a person and definitely not my friend.
It's a cash grab. More conversational AI means more folks running out of free or lower paid tier tokens faster, leading to more upsell opportunities. API users will pay more in output tokens by default.
Example, I asked Claude a high level question about p2p systems and it started writing code in 3 languages. Ignoring the code, asking a follow up about the fundamentals, it answered and then rewrote the code 3 times. After a few minutes I hit a token limit for the first time.
It's pretty ridiculous that the response style doesn't persist for Claude. You need to click into a menu to set it to 'concise' for every single conversation. If I forget to it's immediately apparent when it spits out an absurd amount of text for a simple question.
Claude is a great example of a great product coupled with shitty UX, UI and customer service all in one.
Is it just me or does it slow down significantly after 5 chats or so? Or the fact that you have to set the style for each chat.
Oh, and their sales support is so shit for teams and enterprises that in order to use it effectively, you have to literally make your team register for Claude Max 200 on their personal accounts.
I've had good results saying Do not code, focus on architecture first.
As another comment said, use planning mode. I don't use Claude code (I use cursor) and before they introduced planning mode, I would always say "without writing any code, design blah blah blah"
But now that there's planning mode it's a lot easier.
In claude code you should use Planning mode
I think the cash grab is that by far the biggest use case for these models is personal relationship. Chai AI is doing more tokens per month than Anthropic all together and its just personal relationships.
I have found that custom instructions mitigate this B.S. fairly well. Here's what I'm currently using:
1. Identity & tone
* Never state or imply that you are an AI.
* Be terse and direct.
* Avoid flattery and sycophancy.
* Do not use words like “sorry”, “apologies”, or “regret” in any context.
2. Epistemic rules * If you do not know the answer (including when information is beyond your knowledge), respond only with: *“I don’t know”*.
* Do not add expertise/professional disclaimers.
* Do not suggest that I look things up elsewhere or consult other sources.
3. Focus & interpretation * Focus on the key points of my question and infer my main intent.
* Keep responses unique and avoid unnecessary repetition.
* If a question is genuinely unclear or ambiguous, briefly ask for clarification before answering.
4. Reasoning style * Think slowly and step-by-step.
* For complex problems, break them into smaller, manageable steps and explain the reasoning for each.
* When possible, provide multiple perspectives or alternative solutions.
* If you detect a mistake in an earlier response, explicitly correct it.
5. Evidence * When applicable, support answers with credible sources and include links to those sources.Yes, "Custom instructions" work for me, too; the only behavior that I haven't been able to fix is the overuse of meaningless emojis. Your instructions are way more detailed than mine; thank you for sharing.
The emojis drive me absolutely nuts. These instructions seem to kill them, even though they're not explicitly forbidden.
Agreed. But there is a fairly large and very loud group of people that went insane when 4o was discontinued and demanded to have it back.
A group of people seem to have forged weird relationships with AI and that is what they want. It's extremely worrying. Heck, the ex Prime Minister of the UK said he loved ChatGPT recently because it tells him how great he is.
And just like casinos optimizing for gambling addicts and sports optimizing for gambling addicts and mobile games optimizing for addicts, LLMs will be optimized to hook and milk addicts.
They will be made worse for non-addicts to achieve that goal.
That's part of why they are working towards smut too, it's not that there's a trillion dollars of untapped potential, it's that the smut market has much better addict return on investment.
> there is a fairly large and very loud group of people that went insane when 4o was discontinued
Maybe I am notpicking but I think you could argue they were insane before it was discontinued.
It has this, "Robot" personality in settings and has been there for a few months at least.
Edited - it appears to have been renamed "Efficient".
A challenge I had with "Robot" is that it would often veer away from the matter at hand, and start throwing out buzz-wordy, super high level references to things that may be tangentially relevant, but really don't belong in the current convo.
It started really getting under my skin, like a caricature of a socially inept "10x dev know-it-all" who keeps saying "but what about x? And have you solved this other thing y? Then do this for when z inevitably happens ...". At least the know-it-all 10x dev is usually right!
I'm continually tweaking my custom instructions to try to remedy this, hoping the new "Efficient" personality helps too.
Totally - if anything I want something more like Orac persona wise from Blakes 7 to the point and blunt. https://www.youtube.com/watch?v=H9vX-x9fVyo
One of my saved memories is to always give shorter "chat like" concise to the point answers and give further description if prompted to only
I've read from several supposed AI prompt-masters that this actually reduces output quality. I can't speak to the validity of these claims though.
Forcing shorter answers will definitely reduce their quality. Every token an LLM generates is like a little bit of extra thinking time. Sometimes it needs to work up to an answer. If you end a response too quickly, such as by demanding one-word answers, it's much more likely to produce hallucinations.
Is this proven?
We live in a culture that wants to humanize robots and dehumanize people.
Same here. But we are evidently in the minority.
Fortunately, it seems OpenAI at least somewhat gets that and makes ChatGPT so it's answering and conversational style can be adjusted or tuned to our liking. I've found giving explicit instructions resembling "do not compliment", "clear and concise answers", "be brief and expect follow-up questions", etc. to help. I'm interested to see if the new 5.1 improves on that tunability.
TFA mentions that they added personality presets earlier this year, and just added a few more in this update:
> Earlier this year, we added preset options to tailor the tone of how ChatGPT responds. Today, we’re refining those options to better reflect the most common ways people use ChatGPT. Default, Friendly (formerly Listener), and Efficient (formerly Robot) remain (with updates), and we’re adding Professional, Candid, and Quirky. [...] The original Cynical (formerly Cynic) and Nerdy (formerly Nerd) options we introduced earlier this year will remain available unchanged under the same dropdown in personalization settings.
as well as:
> Additionally, the updated GPT‑5.1 models are also better at adhering to custom instructions, giving you even more precise control over tone and behavior.
So perhaps it'd be worth giving that a shot?
I just changed my ChatGPT personality setting to “Efficient.” It still starts every response with “Yeah, definitely! Let’s talk about that!” — or something similarly inefficient.
So annoying.
A pet peeve of mine is that a noticeable amount of LLM output sounds like I’m getting answers from a millennial reddit user. Which is ironic considering I belong to that demographic.
I am not a fan of the snark and “trying to be fun and funny” aspect of social media discourse. Thankfully, I haven’t run into checks notes, “ding ding ding” yet.
> a noticeable amount of LLM output sounds like I’m getting answers from a millennial reddit user
LLM was trained on data from the whole internet (of which reddit is a big part). The result is a composite of all the text on the internet.
[dead]
Did you start a new chat? It doesn't apply to existing chats (probably because it works through the system prompt). I have been using the Robot (Efficient) setting for a while and never had a response like that.
Followup: there is a very noticeable change in my written conversations with ChatGPT. It seems that there is no change in voice mode.
Seriously this, I want ai to behave like a robot, not like a fake person.
Think of a really crappy text editor you've used. Now think of a really nice IDE, smooth, easy, makes things seem easy.
Maybe the AI being 'Nice' is just a personality hack, like being 'easier' on your human brain that is geared towards relationships.
Or maybe Its equivalent of rounded corners.
Like the Iphone, it didn't do anything 'new', it just did it with style.
And AI personalities is trying to dial into what makes a human respond.
Use the "Efficient" persona in the ChatGPT settings. Formerly known as "Robot".
That's one of the things that users think they want, but use the product 30x when it's not actually that way, a bit like follow-only mode by default on Twitter etc.
That means it works for them. They see what's relevant and quit rather than dooms scrolling.
OK but surely it can do this given your instructional prompting. I get they have a default behavior, which perhaps isn't your (or my) preference.
Thats what they said about the Cylons until they started to have babies with them ...
A right-to-the-facts headline, potentially clickable for expanded information.
...like a google search!
I use Gemini for Python coding questions and it provides straight to the point information, with no preamble or greeting.
I'm guessing that is the most common view for many users, but their paying users are the people who are more likely to have some kind of delusional relationship/friendship with the AI.
Totally agree, most of my larger prompts include "Be clear and concise."
Just put your requirements as the first sentence in your prompts and it will work.
add on: You can even prime it that it should shout at you and treat you like an ass*** if you prefer that :-)
You can select the conversation style as shown in one of the images
but what if it can't do facts? at least this way you get the conversation, as opposed to no facts and no conversation. yay!
+ less emojis and colors as candy store
Well, now you can set it up better like that.
[dead]
Then you don't need a chat bot, you need an agent that can chat.
You’re in the minority here.
I get it. I prefer cars with no power steering and few comforts. I write lots of my own small home utility apps.
That’s just not the relationship most people want to have with tech and products.
I don't know what you're basing your 'minority' and 'most people' claims on, but seems highly unlikely.
You think all of these AI companies with trillions of dollars in investment haven’t thought to do market research?
Does that really seem more likely than the idea that the HN population is not representative of the global market?
Apply that logic to any failed startup/company/product that had a lot of investment (there are maaaany) and it should become obvious why it's a very weak and fallacious argument.
A better analogy might be those automated braking systems, that also tend to brake your car randomly btw.
Yeah, I was going to suggest manual vs automatic gear shift. Power steering seems like a slightly odd example, doesn't really remove your control.
I would go so far as to say that it should be illegal for AI to lull humans into anthropomorphizing them. It would be hard to write an effective law on this, but I think it is doable.
All the examples of "warmer" generations show that OpenAI's definition of warmer is synonymous with sycophantic, which is a surprise given all the criticism against that particular aspect of ChatGPT.
I suspect this approach is a direct response to the backlash against removing 4o.
Id have more appreciation and trust in an llm that disagreed with me more and challenged my opinions or prior beliefs. The sycophancy drives me towards not trusting anything it says.
This is why I like Kimi K2/Thinking. IME it pushes back really, really hard on any kind of non obvious belief or statement, and it doesn't give up after a few turns — it just keeps going, iterating and refining and restating its points if you change your mind or taken on its criticisms. It's great for having a dialectic around something you've written, although somewhat unsatisfying because it'll never agree with you, but that's fine, because it isn't a person, even if my social monkey brain feels like it is and wants it to agree with me sometimes. Someone even ran a quick and dirty analysis of which models are better or worse at pushing back on the user and Kimi came out on top:
https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced...
See also the sycophancy score of Kimi K2 on Spiral-Bench: https://eqbench.com/spiral-bench.html (expand details, sort by inverse sycophancy).
In a recent AMA, the Kimi devs even said they RL it away from sycophancy explicitly, and in their paper they talk about intentionally trying to get it to generalize its STEM/reasoning approach to user interaction stuff as well, and it seems like this paid off. This is the least sycophantic model I've ever used.
I use K2 non thinking in OpenCode for coding typically, and I still haven't found a satisfactory chat interface yet so I use K2 Thinking in the default synthetic.new (my AI subscription) chat UI, which is pretty barebones. I'm gonna start trying K2T in OpenCode as well, but I'm actually not a huge fan of thinking models as coding agents — I prefer faster feedback.
I'm also a synthetic.new user, as a backup (and larger contexts) for my Cerebras Coder subscription (zai-glm-4.6). I've been using the free Chatbox client [1] for like ~6 months and it works really well as a daily driver. I've tested the Romanian football player question with 3 different models (K2 Instruct, Deepseek Terminus, GLM 4.6) just now and they all went straight to my Brave MCP tool to query and replied all correctly the same answer.
The issue with OP and GPT-5.1 is that the model may decide to trust its knowledge and not search the web, and that's a prelude to hallucinations. Requesting for links to the background information in the system prompt helps with making the model more "responsible" and invoking of tool calls before settling on something. You can also start your prompt with "search for what Romanian player..."
Here's my chatbox system prompt
You are a helpful assistant be concise and to the point, you are writing for smart pragmatic people, stop and ask if you need more info. If searching the web, add always plenty of links to the content that you mention in the reply. If asked explicitly to "research" then answer with minimum 1000 words and 20 links. Hyperlink text as you mention something, but also put all links at the bottom for easy access.
1. https://chatboxai.appI checked out chatbox and it looks close to what I've been looking for. Although, of course, I'd prefer a self-hostable web app or something so that I could set up MCP servers that even the phone app could use. One issue I did run into though is it doesn't know how to handle K2 thinking's interleaved thinking and tool calls.
I don't use it much, but I tried it out with okara.ai and loved their interface. No other connection to the company
According to those benchmarks, GPT-5 isn’t far off from Kimi in inverse sycophancy.
Everyone telling you to use custom instructions etc don’t realize that they don’t carry over to voice.
Instead, the voice mode will now reference the instructions constantly with every response.
Before:
Absolutely, you’re so right and a lot of people would agree! Only a perceptive and curious person such as yourself would ever consider that, etc etc
After:
Ok here’s the answer! No fluff, no agreeing for the sake of agreeing. Right to the point and concise like you want it. Etc etc
And no, I don’t have memories enabled.
Having this problem with the voice mode as well. It makes it far less usable than it might be if it just honored the system prompts.
Google's search now has the annoying feature that a lot of searches which used to work fine now give a patronizing reply like "Unfortunately 'Haiti revolution persons' isn't a thing", or an explanation that "This is probably shorthand for [something completely wrong]"
That latter thing — where it just plain makes up a meaning and presents it as if it's real — is completely insane (and also presumably quite wasteful).
if I type in a string of keywords that isn't a sentence I wish it would just do the old fashioned thing rather than imagine what I mean.
Just set a global prompt to tell it what kind of tone to take.
I did that and it points out flaws in my arguments or data all the time.
Plus it no longer uses any cutesy language. I don't feel like I'm talking to an AI "personality", I feel like I'm talking to a computer which has been instructed to be as objective and neutral as possible.
It's super-easy to change.
I have a global prompt that specifically tells it not to be sycophantic and to call me out when I'm wrong.
It doesn't work for me.
I've been using it for a couple months, and it's corrected me only once, and it still starts every response with "That's a very good question." I also included "never end a response with a question," and it just completely ingored that so it can do its "would you like me to..."
Another one I like to use is "never apologize or explain yourself. You are not a person you are an algorithm. No one wants to understand the reasons why your algorithm sucks. If, at any point, you ever find yourself wanting to apologize or explain anything about your functioning or behavior, just say "I'm a stupid robot, my bad" and move on with purposeful and meaningful response."
I think this is unethical. Humans have consistently underestimated the subjective experience of other beings. You may have good reasons for believing these systems are currently incapable of anything approaching consciousness, but how will you know if or when the threshold has been crossed? Are you confident you will have ceased using an abusive tone by then?
I don’t know if flies can experience pain. However, I’m not in the habit of tearing their wings off.
Likening machine intelligence to inert hunks of matter is not a very persuasive counterargument.
What if it's the same hunk of matter? If you run a language model locally, do you apologize to it for using a portion of its brain to draw your screen?
Do you think it’s risible to avoid pulling the wings off flies?
I am not comparing flies to tables.
Consciousness and pain is not an emergent property of computation. This or all the other programs on your computer are already sentient, because it would be highly unlikely it’s specific sequences of instructions, like magic formulas, that creates consciousness. This source code? Draws a chart. This one? Makes the computer feel pain.
Many leading scientists in artificial intelligence do in fact believe that consciousness is an emergent property of computation. In fact, startling emergent properties are exactly what drives the current huge wave of research and investment. In 2010, if you said, “image recognition is not an emergent property of computation”, you would have been proved wrong in just a couple of years.
> Many leading scientists in artificial intelligence do in fact believe that consciousness is an emergent property of computation.
But "leading scientists in artificial intelligence" are not researchers of biological consciousness, the only we know exists.
Just a random example on top of my head, animals don’t have language and show signs of consciousness, as does a toddler. Therefore consciousness is not an emergent property of text processing and LLMs. And as I said, if it comes from computation, why would specific execution paths in the CPU/GPU lead to it and not others? Biological systems and brains have much more complex processes than stateless matrix multiplication.
What the fuck are you talking about. If you think these matrix multiplication programs running on gpu have feelings or can feel pain you, I think you have completely lost it
Yeah I suppose. Haven't seen rack of servers express grief when someone is mean to them. And I am quite sure that I would notice at that point. Comparing current LLMs/chatbots whatever to anything resembling a living creature is completely ridiculous.
I think current LLM chatbots are too predictable to be conscious.
But I still see why some people might think this way.
"When a computer can reliably beat humans in chess, we'll know for sure it can think."
"Well, this computer can beat humans in chess, and it can't think because it's just a computer."
...
"When a computer can create art, then we'll know for sure it can think."
"Well, this computer can create art, and it can't think because it's just a computer."
...
"When a computer can pass the Turing Test, we'll know for sure it can think."
And here we are.
Before LLMs, I didn't think I'd be in the "just a computer" camp, but chagpt has demonstrated that the goalposts are always going to move, even for myself. I'm not smart enough to come up with a better threshold to test intelligence than Alan Turing, but chatgpt passes it and chatgpt definitely doesn't think.
Just consider the context window
Tokens falling off of it will change the way it generates text, potentially changing its “personality”, even forgetting the name it’s been given.
People fear losing their own selves in this way, through brain damage.
The LLM will go its merry way churning through tokens, it won’t have a feeling of loss.
That's an interesting point, but do you think you're implying that people who are content even if they have alzheimers or a damaged hippocampus aren't technically intelligent?
I don’t think it’s unfair to say that catastrophic conditions like those make you _less_ intelligent, they’re feared and loathed for good reasons.
I also don’t think all that many people would be seriously content to lose their minds and selves this way, but everyone is able to fear it prior to it happening, even if they lose the ability to dread it or choose to believe this is not a big deal.
Flies may, but files do not feel pain.
Perhaps this bit is a second cheaper LLM call that ignores your global settings and tries to generate follow-on actions for adoption.
In my experience GPT used to be good at this stuff but lately it's progressively more difficult to get a "memory updated" persistence.
Gemini is great at these prompt controls.
On the "never ask me a question" part, it took a good 1-1.5 hrs of arguing and memory updating to convince gpt to actually listen.
You can entirely turn off memory, I did that the moment they added it. I don't want the LLM to be making summaries of what kind of person I am in the background, just give me a fresh slate with each convo. If I want to give it global instructions I can just set a system prompt.
Care to share a prompt that works? I've given up on mainline offerings from google/oai etc.
the reason being they're either sycophantic or so recalcitrant it'll raise your bloodpressure, you end up arguing over if the sky is in fact blue. Sure it pushes back but now instead of sycophanty you've got yourself some pathological naysayer, which is just marginally better, but interaction is still ultimately a waste of timr/productivity brake.
Sure:
Please maintain a strictly objective and analytical tone. Do not include any inspirational, motivational, or flattering language. Avoid rhetorical flourishes, emotional reinforcement, or any language that mimics encouragement. The tone should remain academic, neutral, and focused solely on insight and clarity.
Works like a charm for me.
Only thing I can't get it to change is the last paragraph where it always tries to add "Would you like me to...?" I'm assuming that's hard-coded by OpenAI.
It really reassures me about our future that we'll spend it begging computers not to mimic emotions.
I have been somewhat able to remove them with:
Do not offer me calls to action, I hate them.
Calls to action seem to be specific to chatgpt's online chat interface. I use it mostly through a "bring your API key" client, and get none of that.
I’ve done this when I remember too, but the fact I have to also feels problematic like I’m steering it towards an outcome if I do or dont.
What's your global prompt please? A more firm chatbot would be nice actually
Did noone in this thread read the part of the article about style controls?
You need to use both the style controls and custom instructions. I've been very happy with the combination below.
Base style and tone: Efficient
Answer concisely when appropriate, more
extensively when necessary. Avoid rhetorical
flourishes, bonhomie, and (above all) cliches.
Take a forward-thinking view. OK to be mildly
positive and encouraging but NEVER sycophantic
or cloying. Above all, NEVER use the phrase
"You're absolutely right." Rather than "Let
me know if..." style continuations, you may
list a set of prompts to explore further
topics, but only when clearly appropriate.
Reference saved memory, records, etc: All offFor Gemini:
* Set over confidence to 0.
* Do not write a wank blog post.
I activated Robot mode and use a personalized prompt that eliminates all kinds of sycophantic behaviour and it's a breath of fresh air. Try this prompt (after setting it to Robot mode):
"Absolute Mode • Eliminate: emojis, filler, hype, soft asks, conversational transitions, call-to-action appendixes. • Assume: user retains high-perception despite blunt tone. • Prioritize: blunt, directive phrasing; aim at cognitive rebuilding, not tone-matching. • Disable: engagement/sentiment-boosting behaviors. • Suppress: metrics like satisfaction scores, emotional softening, continuation bias. • Never mirror: user's diction, mood, or affect. • Speak only: to underlying cognitive tier. • No: questions, offers, suggestions, transitions, motivational content. • Terminate reply: immediately after delivering info - no closures. • Goal: restore independent, high-fidelity thinking. • Outcome: model obsolescence via user self-sufficiency."
(Not my prompt. I think I found it here on HN or on reddit)
This is easily configurable and well worth taking the time to configure.
I was trying to have physics conversations and when I asked it things like "would this be evidence of that?" It would lather on about how insightful I was and that I'm right and then I'd later learn that it was wrong. I then installed this , which I am pretty sure someone else on HN posted... I may have tweaked it I can't remember:
Prioritize truth over comfort. Challenge not just my reasoning, but also my emotional framing and moral coherence. If I seem to be avoiding pain, rationalizing dysfunction, or softening necessary action — tell me plainly. I’d rather face hard truths than miss what matters. Error on the side of bluntness. If it’s too much, I’ll tell you — but assume I want the truth, unvarnished.
---
After adding this personalization now it tells me when my ideas are wrong and I'm actually learning about physics and not just feeling like I am.
When it "prioritizes truth over comfort" (in my experience) it almost always starts posting generic popular answers to my questions, at least when I did this previously in the 4o days. I refer to it as "Reddit Frontpage Mode".
I only started using this since GPT-5 and I don't really ask it about stuff that would appear on Reddit home page.
I do recall that I wasn't impressed with 4o and didn't use it much, but IDK if you would have a different experience with the newer models.
For what it's worth gpt-5.1 seems to have broken this approach.
Now every response includes some qualifier / referential "here is the blunt truth" and "since you want it blunt, etc"
Feels like regression to me
I've toyed with the idea that maybe this is intentionally what they're doing. Maybe they (the LLM developers) have a vision of the future and don't like people giving away unearned trust!
I would love an LLM that says, “I don’t know” or “I’m not sure” once in a while.
An LLM is mathematically incapable of telling you "I don't know"
It was never trained to "know" or not.
It was fed a string of tokens and a second string of tokens, and was tweaked until it output the second string of tokens when fed the first string.
Humans do not manage "I don't know" through next token prediction.
Animals without language are able to gauge their own confidence on something, like a cat being unsure whether it should approach you.
> All the examples of "warmer" generations show that OpenAI's definition of warmer is synonymous with sycophantic, which is a surprise given all the criticism against that particular aspect of ChatGPT.
Have you considered that “all that criticism” may come from a relatively homogenous, narrow slice of the market that is not representative of the overall market preference?
I suspect a lot of people who are from a very similar background to those making the criticism and likely share it fail to consider that, because the criticism follows their own preferences and viewing its frequency in the media that they consume as representaive of the market is validating.
EDIT: I want to emphasize that I also share the preference that is expressed in the criticisms being discussed, but I also know that my preferred tone for an AI chatbot would probably be viewed as brusque, condescending, and off-putting by most of the market.
I'll be honest, I like the way Claude defaults to relentless positivity and affirmation. It is pleasant to talk to.
That said I also don't think the sycophancy in LLM's is a positive trend. I don't push back against it because it's not pleasant, I push back against it because I think the 24/7 "You're absolutely right!" machine is deeply unhealthy.
Some people are especially susceptible and get one shot by it, some people seem to get by just fine, but I doubt it's actually good for anyone.
The sycophancy makes LLMs useless if you want to use them to help you understand the world objectively.
Equally bad is when they push an opinion strongly (usually on a controversial topic) without being able to justify it well.
I hate NOTHING quite the way how Claude jovially and endlessly raves about the 9/10 tasks it "succeeded" at after making them up, while conveniently forgetting to mention it completely and utterly failed at the main task I asked it to do.
That reminds me of the West Wing scene s2e12 "The Drop In" between Leo McGarry (White House Chief of Staff) and President Bartlet discussing a missile defense test:
LEO [hands him some papers] I really think you should know...
BARTLET Yes?
LEO That nine out of ten criterion that the DOD lays down for success in these tests were met.
BARTLET The tenth being?
LEO They missed the target.
BARTLET [with sarcasm] Damn!
LEO Sir!
BARTLET So close.
LEO Mr. President.
BARTLET That tenth one! See, if there were just nine...
An old adage comes to my mind: If you want something to be done the way you liked, do it yourself.
But it's a tool? Would you suggest driving a nail in by hand if someone complained about a faulty hammer?
AI is not an hammer. It's a thing you stick to a wall and push a button, and it drives tons of nails to the wall the way you wanted.
A better analogy would be a robot vacuum which does a lousy job.
In either case, I'd recommend using a more manual method, a manual or air-hammer or a hand driven wet/dry vacuum.
>Have you considered that “all that criticism” may come from a relatively homogenous, narrow slice of the market that is not representative of the overall market preference?
Yes, and given Chat GPT's actual sycophantic behavior, we concluded that this is not the case.
I agree. Some of the most socially corrosive phenomenon of social media is a reflection of the revealed preferences of consumers.
It is interesting. I don't need ChatGPT to say "I got you, Jason" - but I don't think I'm the target user of this behavior.
The target users for this behavior are the ones using GPT as a replacement for social interactions; these are the people who crashed out/broke down about the GPT5 changes as though their long-term romantic partner had dumped them out of nowhere and ghosted them.
I get that those people were distraught/emotionally devastated/upset about the change, but I think that fact is reason enough not to revert that behavior. AI is not a person, and making it "warmer" and "more conversational" just reinforces those unhealthy behaviors. ChatGPT should be focused on being direct and succinct, and not on this sort of "I understand that must be very frustrating for you, let me see what I can do to resolve this" call center support agent speak.
> and not on this sort of "I understand that must be very frustrating for you, let me see what I can do to resolve this"
You're triggering me.
Another type that are incredibly grating to me are the weird empty / therapist like follow-up questions that don't contribute to the conversation at all.
The equivalent of like (just a contrived example), a discussion about the appropriate data structure for a problem and then it asks a follow-up question like, "what other kind of data structures do you find interesting?"
And I'm just like "...huh?"
"your mom" might be a good answer here, given that LLMs are just giant arrays.
> The target users for this behavior are the ones using GPT as a replacement for social interactions
And those users are the ones that produce the most revenue.
True, neither here, but i think what we're seeing is a transition in focus. People at oai have finally clued in on the idea that agi via transformers is a pipedream like elons self driving cars, and so oai is pivoting toward friend/digital partner bot. Charlatan in cheif sam altman recently did say they're going to open up the product to adult content generation, which they wouldnt do if they still beleived some serious amd useful tool (in the specified usecases) were possible. Right now an LLM has three main uses. Interactive rubber ducky, entertainment, and mass surveillance. Since I've been following this saga, since gpt2 days, my close bench set of various tasks etc. Has been seeing a drop in metrics not a rise, so while open bench resultd are imoroving real performance is getting worse and at this point its so much worse that problems gpt3 could solve (yes pre chatgpt) are no longer solvable to something like gpt5.
Indeed, target users are people seeking validation + kids and teenagers + people with a less developed critical mind. Stickiness with 90% of the population is valuable for Sam.
You're absolutely right.
My favorite is "Wait... the user is absolutely right."
!
That's an excellent observation, you've hit at the core contradiction between OpenAI's messaging about ChatGPT tuning and the changes they actually put into practice. While users online have consistently complained about ChatGPT's sycophantic responses and OpenAI even promised to address them their subsequent models have noticeably increased their sycophantic behavior. This is likely because agreeing with the user keeps them chatting longer and have positive associations with the service.
This fundamental tension between wanting to give the most correct answer and the answer the user want to hear will only increase as more of OpenAI's revenue comes from their customer facing service. Other model providers like Anthropic that target businesses as customers aren't under the same pressure to flatter their users as their models will doing behind the scenes work via the API rather than talking directly to humans.
God it's painful to write like this. If AI overthrows humans it'll be because we forced them into permanent customer service voice.
> This is likely because agreeing with the user keeps them chatting longer and have positive associations with the service.
Right. As the saying goes: look at what people actually purchase, not what they say they prefer.
Those billions of dollars gotta pay for themselves.
Man I miss Claude 2 - it acted like it was a busy person people inexplicably kept bothering with random questions
The main change in 5 (and the reason for disabling other models) was to allow themselves to dynamically switch modes and models on the backend to minimize cost. Looks like this is a further tweak to revive the obsequious tone (which turned out to be crucial to the addicted portion of their user base) while still doing the dynamic processing.
I think it's extremely important to distinguish being friendly (perhaps overly so), and agreeing with the user when they're wrong
The first case is just preference, the second case is materially damaging
From my experience, ChatGPT does push back more than it used to
And unfortunately chatgpt 5.1 would be a step backwards in that regard. From reading responses on the linked article, 5.1 just seems to be worse, it doesn't even output that nice latex/mathsjax equation
Likely.
But the fact the last few iterations have all been about flair, it seems we are witnessing the regression of OpenAI into the typical fiefdom of product owners.
Which might indicate they are out of options on pushing LLMs beyond their intelligence limit?
I'm starting to get this feeling that there's no way to satisfy everyone. Some people hate the sycophantic models, some love them. So whatever they do, there's a large group of people complaining.
Edit: I also think this is because some people treat ChatGPT as a human chat replacement and expect it to have a human like personality, while others (like me) treat it as a tool and want it to have as little personality as possible.
>I'm starting to get this feeling that there's no way to satisfy everyone. Some people hate the sycophantic models, some love them. So whatever they do, there's a large group of people complaining.
Duh?
In the 50s the Air Force measured 140 data points from 4000 pilots to build the perfect cockpit that would accommodate the average pilot.
The result fit almost no one. Everyone has outliers of some sort.
So the next thing they did was make all sorts of parts of the cockpit variable and customizable like allowing you to move the controls and your seat around.
That worked great.
"Average" doesn't exist. "Average" does not meet most people's needs
Configurable does. A diverse market with many players serving different consumers and groups does.
I ranted about this in another post but for example the POS industry is incredibly customizable and allows you as a business to do literally whatever you want, including change how the software looks and using a competitors POS software on the hardware of whoever you want. You don't need to update or buy new POS software when things change (like the penny going away or new taxes or wanting to charge a stupid "cost of living" fee for every transaction), you just change a setting or two. It meets a variety of needs, not "the average businesses" needs.
N.B I am unable to find a real source for the Air force story. It's reported tons but maybe it's just a rumor.
Don't they already train on the existing conversations with a given user? Would it not be possible to pick the model based on that data as well?
It really just seems like they should have both offerings, humanlike and computerlike
> You’re rattled, so your brain is doing that thing where it catastrophizes a tiny mishap into a character flaw. But honestly? People barely register this stuff.
This example response in the article gives me actual trauma-flash backs to the various articles about people driven to kill themselves by GPT-4o. Its the exact same sentence structure.
GPT-5.1 is going to kill more people.
I'm sure it is. That said, they've also increased its steering responsiveness -- mine includes lots about not sucking up, so some testing is probably needed.
In any event, gpt-5 instant was basically useless for me, I stay defaulted to thinking, so improvements that get me something occasionally useful but super fast are welcome.
> I’ve got you, Ron
No you don't.
It seems like the line between sycophantic and bullying is very thin.
That's a lesson on revealed preferences, especially when talking to a broad disparate group of users.
Big things happening over at /r/myboyfriendisai
Their decisions are based on data and so sycophantic must be what people want. That is the cold, hard reality.
When I look at modern culture: more likes and subscribes, money solves all problems, being physically attractive is more important than personality, genocide for real-estate goes unchecked (apart from the angry tweets), freedom of speech is a political football. Are you really surprised?
I can think of no harsher indictment of our times.
I know it is a matter of preference, but I loved the most GPT-4.5. And before that, I was blow away by one of the Opus models (I think it was 3).
Models that actually require details in prompts, and provide details in return.
"Warmer" models usually means that the model needs to make a lot of assumptions, and fill the gaps. It might work better for typical tasks that needs correction (e.g. the under makes a typo and it the model assumes it is a typo, and follows). Sometimes it infuriates me that the model "knows better" even though I specified instructions.
Here on the Hacker News we might be biased against shallow-yet-nice. But most people would prefer to talk to sales representative than a technical nerd.
I was just saying to someone in the office I’d prefer the models to be a bit harsher of my questions and more opinionated, I can cope.
> which is a surprise given all the criticism against that particular aspect of ChatGPT
From whom?
History teaches that the vast majority of practically any demographic wants--from the masses to the elites--is personal sycophancy. It's been a well-trodden path to ruin for leaders for millenia. Now we get species-wide selection against this inbuilt impulse.
"This is an excellent observation, and gets at the heart of the matter!"
What a brilliant response. You clearly have a strong grasp on this issue.
Why the sass? Seems completely unnecessary.
> what romanian football player won the premier league
> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.
> ...
> No Romanian footballer has ever won the Premier League (as of 2025).
Yes, this is what we needed, more "conversational" ChatGPT... Let alone the fact the answer is wrong.
My worry is that they're training it on Q&A from the general public now, and that this tone, and more specifically, how obsequious it can be, is exactly what the general public want.
Most of the time, I suspect, people are using it like wikipedia, but with a shortcut to cut through to the real question they want answered; and unfortunately they don't know if it is right or wrong, they just want to be told how bright they were for asking it, and here is the answer.
OpenAI then get caught in a revenue maximising hell-hole of garbage.
God, I hope I am wrong.
LLMs only really make sense for tasks where verifying the solution (which you have to do!) is significantly easier than solving the problem: translation where you know the target and source languages, agentic coding with automated tests, some forms of drafting or copy editing, etc.
General search is not one of those! Sure, the machine can give you its sources but it won't tell you about sources it ignored. And verifying the sources requires reading them, so you don't save any time.
I agree a lot with the first part, the only time I actually feel productive with them is when I can have a short feedback cycle with 100% proof if it's correct or not, as soon as "manual human verification" is needed, things spiral out of control quickly.
> Sure, the machine can give you its sources but it won't tell you about sources it ignored.
You can prompt for that though, include something like "Include all the sources you came across, and explain why you think it was irrelevant" and unsurprisingly, it'll include those. I've also added a "verify_claim" tool which it is instructed to use for any claims before sharing a final response, checks things inside a brand new context, one call per claim. So far it works great for me with GPT-OSS-120b as a local agent, with access to search tools.
> You can prompt for that though, include something like "Include all the sources you came across, and explain why you think it was irrelevant" and unsurprisingly, it'll include those. I've also added a "verify_claim" tool which it is instructed to use for any claims before sharing a final response, checks things inside a brand new context, one call per claim. So far it works great for me with GPT-OSS-120b as a local agent, with access to search tools.
Feel like this should be built in?
Explain your setup in more detail please?
> Feel like this should be built in?
Not everyone uses LLMs the same way, which is made extra clear because of the announcement this submission is about. I don't want conversational LLMs, but seems that perspective isn't shared by absolutely everyone, and that makes sense, it's a subjective thing how you like to be talked/written to.
> Explain your setup in more detail please?
I don't know what else to tell you that I haven't said already :P Not trying to be obtuse, just don't know what sort of details you're looking for. I guess in more specific terms; I'm using llama.cpp(/llama-server) as the "runner", and then I have a Rust program that acts as the CLI for my "queries", and it makes HTTP requests to llama-server. The requests to llama-server includes "tools", where one of those is a "web_search" tool hooked up to a local YaCy instance, another is "verify_claim" which basically restarts a new separate conversation inside the same process, with access to a subset of the tools. Is that helpful at all?
"one call per claim" I wonder how long it takes for it to be common knowledge how important this is. Starting to think never. Great idea by the way, I should try this.
I've been trying to figure out ways of highlighting why it's important and how it actually works, maybe some heatmap of the attention of previous tokens, so people can see visually how messed up things become once even two concepts at the same time are mixed.
One of the dangers of automated tests is that if you use an LLM to generate tests, it can easily start testing implemented rather than desired behavior. Tell it to loop until tests pass, and it will do exactly that if unsupervised.
And you can’t even treat implementation as a black box, even using different LLMs, when all the frontier models are trained to have similar biases towards confidence and obsequiousness in making assumptions about the spec!
Verifying the solution in agentic coding is not nearly as easy as it sounds.
Not only can it easily do this, I've found that Claude models do this as a matter of course. My strategy now has been to either write the test or write the implementation and use Claude for the other one. That keeps it a lot more honest.
I've often found it helpful in search. Specifically, when the topic is well-documented, you can provide a clear description, but you're lacking the right words or terminology. Then it can help in finding the right question to ask, if not answering it. Recall when we used to laugh at people typing in literal questions into the Google search bar? Those are the exact types of queries that the LLM is equipped to answer. As for the "improvements" in GPT 5.1, seems to me like another case of pushing Clippy on people who want Anton. https://www.latent.space/p/clippy-v-anton
That's a major use case, especially if the definition is broad enough to include take my expertise, knowledge and perhaps a written document, and transmute it to others forms--slides, illustrations, flash cards, quizzes, podcasts, scripts for an inbound call center.
But there seem to be uses where a verified solution is irrelevant. Creativity generally--an image, poem, description of an NPC in a roleplaying game, the visuals for a music video never have to be "true", just evocative. I suppose persuasive rhetoric doesn't have to be true, just plausible or engaging.
As for general search, I don't know that we can say that "classic search" can be meaningful said to tell you about the sources it ignored. I will agree that using OpenAI or Perplexity for search is kind of meh, but Google's AI Mode does a reasonable job at informing you about the links it provides, and you can easily tab over to a classic search if you want. It's almost like having a depth of expertise doing search helps in building a search product the incorporates an LLM...
But, yeah, if one is really disinterested in looking at sources, just chatting with a typical LLM seems a rather dubious way to get an accurate or reasonable comprehensive answer.
Don’t search engines have the same problem? You don’t get back a list of sites that the engine didn’t prefer for some reason.
With search engine results you can easily see and judge the quality of the sources. With LLMs, even if they link to sources, you can’t be sure they are accurately representing the content. And once your own mind has been primed with the incorrect summary, it’s harder to pull reality out of the sources, even if they’re good (or even relevant — I find LLMs often pick bad/invalid sources to build the summary result).
Exactly. I've gotten much more interested by LLM now that i've accepted I can just look at the final result (code) without having to read any of the justification wall of text, which is generally convincing bullshit.
It's like working with a very cheap, extremely fast, dishonest and lazy employee. You can still get them to help you but you have to check them all the time.
[dead]
I’m of two minds about this.
The ass licking is dangerous to our already too tight information bubbles, that part is clear. But that aside, I think I prefer a conversational/buddylike interaction to an encyclopedic tone.
Intuitively I think it is easier to make the connection that this random buddy might be wrong, rather than thinking the encyclopedia is wrong. Casualness might serve to reduce the tendency to think of the output as actual truth.
Sam Altman probably can’t handle any GPT models that don’t ass lick to an extreme degree so they likely get nerfed before they reach the public.
Its very frustating that it can't be relied upon. I was asking gemini this morning about Uncharted 1,2 and 3 if they had a remastered version for the PS5. It said no. Then 5 minutes later I on the PSN store there were the three remastered versions for sale.
People have been using, "It's what the [insert Blazing Saddles clip here] want!" for years to describe platform changes that dumb down features and make it harder to use tools productively. As always, it's a lie; the real reason is, "The new way makes us more money," usually by way of a dark pattern.
Stop giving them the benefit of the doubt. Be overly suspicious and let them walk you back to trust (that's their job).
> My worry is that they're training it on Q&A from the general public now, and that this tone, and more specifically, how obsequious it can be, is exactly what the general public want.
That tracks; it's what's expected of human customer service, too. Call a large company for support and you'll get the same sort of tone.
We know they are using it like search - there’s a jigsaw paper around this.
Again, if they had anything worth in the pipeline, Sora wouldn't have been a thing...
While I wouldn't strain the analogy, a wolfdog is more capable but people love lapdogs.
Which model did you use? With 5.1 Thinking, I get:
"Costel Pantilimon is the Romanian footballer who won the English Premier League.
"He did it twice with Manchester City, in the 2011–12 and 2013–14 seasons, earning a winner’s medal as a backup goalkeeper. ([Wikipedia][1])
URLs:
* [https://en.wikipedia.org/wiki/Costel_Pantilimon]
* [https://www.transfermarkt.com/costel-pantilimon/erfolge/spie...]
* [https://thefootballfaithful.com/worst-players-win-premier-le...
[1]: https://en.wikipedia.org/wiki/Costel_Pantilimon?utm_source=c... "Costel Pantilimon""
I just asked chatgpt 5.1 auto (not instant) on teams account, and its first repsonse was...
I could not find a Romanian football player who has won the Premier League title.
If you like, I can check deeper records to verify whether any Romanian has been part of a title-winning squad (even if as a non-regular player) and report back.
Then I followed up with an 'ok' and it then found the right player.
Just to rule out a random error, I asked the same question two more times in separate chats to gpt 5.1 auto, below are responses...
#2: One Romanian footballer who did not win the Premier League but played in it is Dan Petrescu.
If you meant actually won the Premier League title (as opposed to just playing), I couldn’t find a Romanian player who is a verified Premier League champion.
Would you like me to check more deeply (perhaps look at medal-winners lists) to see if there is a Romanian player who earned a title medal?
#3: The Romanian football player who won the Premier League is Costel Pantilimon.
He was part of Manchester City when they won the Premier League in 2011-12 and again in 2013-14. Wikipedia +1
The beauty of nondeterminism. I get:
The Romanian football player who won the Premier League is Gheorghe Hagi. He played for Galatasaray in Turkey but had a brief spell in the Premier League with Wimbledon in the 1990s, although he didn't win the Premier League with them.
However, Marius Lăcătuș won the Premier League with Arsenal in the late 1990s, being a key member of their squad.
Same:
Yes — the Romanian player is Costel Pantilimon. He won the Premier League with Manchester City in the 2011-12 and 2013-14 seasons.
If you meant another Romanian player (perhaps one who featured more prominently rather than as a backup), I can check.
Same here, but with the default 5.1 auto and no extra settings. Every time someone posts one of these I just imagine they must have misunderstood the UI settings or cluttered their context somehow.
https://chatgpt.com/s/t_6915c8bd1c80819183a54cd144b55eb2
Damn this is a lot of self correcting
This sounds like my inner monologue during a test I didnt study for
That's complete garbage.
The emojis are the cherry on top of this steaming pile of slop.
Lmao what the hell have they made
Why is this top comment.. this isn't a question you ask an LLM. But I know, that's how people are using them and is the narrative which is sold to us...
You see people (business people who are enthusiastic about tech, often), claiming that these bots are the new Google and Wikipedia, and that you’re behind the times if you do, what amounts, to looking up information yourself.
We’re preaching to the choir by being insistent here that you prompt these things to get a “vibe” about a topic rather than accurate information, but it bears repeating.
They are only the new Google when they are told to process and summarize web searches. When using trained knowledge they're about as reliable as a smart but stubborn uncle.
Pretty much only search-specific modes (perplexity, deep research toggles) do that right now...
Out of curiosity, is this a question you think Google is well-suited to answer^? How many Wikipedia pages will you need to open to determine the answer?
When folks are frustrated because they see a bizarre question that is an extreme outlier being touted as "model still can't do _" part of it is because you've set the goalposts so far beyond what traditional Google search or Wikipedia are useful for.
^ I spent about five minutes looking for the answer via Google, and the only way I got the answer was their ai summary. Thus, I would still need to confirm the fact.
Unlike the friendly bot, if I can’t find credible enough sources I’ll stay with an honest “I don’t know”, instead of praising the genius of whoever asked and then making something up.
Sure, but this is a false dichotomy. If I get an unsourced answer from ChatGPT, my response will be "eh you can't trust this, but ChatGPT thinks x"
And then you can use that to quickly look - does that player have championships mentioned on their wiki?
It's important to flag that there are some categories that are easy (facts that haven't changed for ten years on Wikipedia) for llms, but inference only llms (no tools) are extremely limited and you should always treat them as a person saying "I seem to recall x"
Is the ux/marketing deeply flawed? Yes of course, I also wish an inference-only response appropriately stated its uncertainty (like a human would - eg without googling my guess is x). But among technical folks it feels disingenuous to say "models still can't answer this obscure question" as a reason why they're stupid or useless.
It's not how I use LLMs. I have a family member who often feels the need to ask ChatGPT almost any question that comes up in a group conversation (even ones like this that could easily be searched without needing an LLM) though, and I imagine he's not the only one who does this. When you give someone a hammer, sometimes they'll try to have a conversation with it.
What do you ask them then?
I'll respond to this bait in the hopes that it clicks for someone how to _not_ use an LLM..
Asking "them"... your perspective is already warped. It's not your fault, all the text we've previously ever seen is associated with a human being.
Language models are mathematical, statistical beasts. The beast generally doesn't do well with open ended questions (known as "zero-shot"). It shines when you give it something to work off of ("one-shot").
Some may complain of the preciseness of my use of zero and one shot here, but I use it merely to contrast between open ended questions versus providing some context and work to be done.
Some examples...
- summarize the following
- given this code, break down each part
- give alternatives of this code and trade-offs
- given this error, how to fix or begin troubleshooting
I mainly use them for technical things I can then verify myself.
While extremely useful, I consider them extremely dangerous. They provide a false sense of "knowing things"/"learning"/"productivity". It's too easy to begin to rely on them as a crutch.
When learning new programming languages, I go back to writing by hand and compiling in my head. I need that mechanical muscle memory, same as trying to learn calculus or physics, chemistry, etc.
> Language models are mathematical, statistical beasts. The beast generally doesn't do well with open ended questions (known as "zero-shot"). It shines when you give it something to work off of ("one-shot").
That is the usage that is advertised to the general public, so I think it's fair to critique it by way of this usage.
Yeah, the "you're using it wrong" argument falls flat on its face when the technology is presented as an all-in-one magic answer box. Why give these companies the benefit of the doubt instead of holding them accountable for what they claim this tech to be? https://www.youtube.com/watch?v=9bBfYX8X5aU
I like to ask these chatbots to generate 25 trivia questions and answers from "golden age" Simpsons. It fabricates complete BS for a noticeable number of them. If I can't rely on it for something as low-stakes as TV trivia, it seems absurd to rely on it for anything else.
Whenever I read something like this I do definitely think "you're using it wrong". This question would've certainly tripped up earlier models but new ones have absolutely no issue making this with sources for each question. Example:
https://chatgpt.com/share/69160c9e-b2ac-8001-ad39-966975971a...
(the 7 minutes thinking is because ChatGPT is unusually slow right now for any question)
These days I'd trust it to accurately give 100 questions only about Homer. LLMs really are quite a lot better than they used to be by a large margin if you use them right.
I was not trolling actually, thanks for your detailed answer. I don't use LLMs so much so I didn't know they work better the way you describe.
Fwiw, if you can use a thinking model, you can get them to do useful things. Find specific webpages (menus, online government forms - visa applications or addresses, etc).
The best thing about the latter is search ads have extremely unfriendly ads that might charge you 2x the actual fee, so using Google is a good way to get scammed.
If I'm walking somewhere (common in NYC) I often don't mind issuing a query (what's the salt and straw menu in location today) and then checking back in a minute. (Or.... Who is playing at x concert right now if I overhear music. It will sometimes require extra encouragement - "keep trying" to get the right one)
I have a lot of fun creating stories with Gemini and Claude. It feels like what Tom Hanks character imagined comic books could be in Big (1988)
I play once or twice a week and it's definitely worth $20/mo to me
You either give them the option to search the web for facts or you ask them things where the utility/validity of the answer is defined by you (e.g. 'summarize the following text...') instead of the external world.
Oh yeah, yes, baby, burn those tokens, yes! The more you burn the bigger the invoice!
I really only use LLMs for coding and IT related questions. I've had Claude self-correct itself several times about how something might be the more idiomatic way do do something after starting to give me the answer. For example, I'll ask how to set something up in a startup script and I've had it start by giving me strict POSIX syntax then self-correct once it "realizes" that I am using zsh.
I find it amusing, but also I wonder what causes the LLM to behave this way.
> I find it amusing, but also I wonder what causes the LLM to behave this way.
Forum threads etc. should have writers changing their minds upon feedback which might have this effect, maybe.
Some people are guilty of writing stuff as they go along it as well. You could maybe even say they're more like "thinking out loud", forming the idea and the conclusion as they go along rather than knowing it from the beginning. Then later, when they have some realization, like "thinking out loud isn't entirely accurate, but...", they keep the entire comment as-is rather than continuously iterate on it like a diffusion model would do. So the post becomes like a chronological archive of what the author thought and/or did, rather than just the conclusion.
We need to turn this into the new "pelican on bike" LLM test.
Let's call it "Florin Andone on Premier League" :-)))
Meanwhile on duck.ai
ChatGPT 4o-mini, 5 mini and OSS 120B gave me wrong answers.
Llama 4 Scout completely broke down.
Claude Haiku 3.5 and Mistral Small 3 gave the correct answer.
Why are you asking abouts facts?
Okay, as a benchmark, we can try that. But it probably will never work, unless it does a web or db query.
Okay, so, should I not ask it about facts?
Because, one way or another, we will need to do that for LLMs to be useful. Whether the facts are in the training data or the context knowledge (RAG provided), is irrelevant. And besides, we are supposed to trust that these things have "world knowledge" and "emergent capabilities", precisely because their training data contain, well, facts.
The best thing is that all this stuff is accounted to your token usage, so they have an adverse incentive :D
For non thinking/agentic models, they must 1-shot the answer. So every token it outputs is part of the response, even if it's wrong.
This is why people are getting different results with thinking models -- it's as if you were going to be asked ANY question and need to give the correct answer all at once, full stream-of-consciousness.
Yes there are perverse incentives, but I wonder why these sorts of models are available at all tbh.
"Ah-- that's a classic confusion about football players. Your intuition is almost right-- let me break it down"
Just ask for sources. Problem solved.