Hacker News

Claude 4

2025-05-2216:3420131170www.anthropic.com

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Show article

Today, we’re introducing the next generation of Claude models: Claude Opus 4 and Claude Sonnet 4, setting new standards for coding, advanced reasoning, and AI agents.

Claude Opus 4 is the world’s best coding model, with sustained performance on complex, long-running tasks and agent workflows. Claude Sonnet 4 is a significant upgrade to Claude Sonnet 3.7, delivering superior coding and reasoning while responding more precisely to your instructions.

Alongside the models, we're also announcing:

Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Claude to alternate between reasoning and tool use to improve responses.
New model capabilities: Both models can use tools in parallel, follow instructions more precisely, and—when given access to local files by developers—demonstrate significantly improved memory capabilities, extracting and saving key facts to maintain continuity and build tacit knowledge over time.
Claude Code is now generally available: After receiving extensive positive feedback during our research preview, we’re expanding how developers can collaborate with Claude. Claude Code now supports background tasks via GitHub Actions and native integrations with VS Code and JetBrains, displaying edits directly in your files for seamless pair programming.
New API capabilities: We’re releasing four new capabilities on the Anthropic API that enable developers to build more powerful AI agents: the code execution tool, MCP connector, Files API, and the ability to cache prompts for up to one hour.

Claude Opus 4 and Sonnet 4 are hybrid models offering two modes: near-instant responses and extended thinking for deeper reasoning. The Pro, Max, Team, and Enterprise Claude plans include both models and extended thinking, with Sonnet 4 also available to free users. Both models are available on the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI. Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.

Claude 4

Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish.

Claude Opus 4 excels at coding and complex problem-solving, powering frontier agent products. Cursor calls it state-of-the-art for coding and a leap forward in complex codebase understanding. Replit reports improved precision and dramatic advancements for complex changes across multiple files. Block calls it the first model to boost code quality during editing and debugging in its agent, codename goose, while maintaining full performance and reliability. Rakuten validated its capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance. Cognition notes Opus 4 excels at solving complex challenges that other models can't, successfully handling critical actions that previous models have missed.

Claude Sonnet 4 significantly improves on Sonnet 3.7's industry-leading capabilities, excelling in coding with a state-of-the-art 72.7% on SWE-bench. The model balances performance and efficiency for internal and external use cases, with enhanced steerability for greater control over implementations. While not matching Opus 4 in most domains, it delivers an optimal mix of capability and practicality.

GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot. Manus highlights its improvements in following complex instructions, clear reasoning, and aesthetic outputs. iGent reports Sonnet 4 excels at autonomous multi-feature app development, as well as substantially improved problem-solving and codebase navigation—reducing navigation errors from 20% to near zero. Sourcegraph says the model shows promise as a substantial leap in software development—staying on track longer, understanding problems more deeply, and providing more elegant code quality. Augment Code reports higher success rates, more surgical code edits, and more careful work through complex tasks, making it the top choice for their primary model.

These models advance our customers' AI strategies across the board: Opus 4 pushes boundaries in coding, research, writing, and scientific discovery, while Sonnet 4 brings frontier performance to everyday use cases as an instant upgrade from Sonnet 3.7.

Memory: When given access to local files, Claude Opus 4 records key information to help improve its game play. The notes depicted above are real notes taken by Opus 4 while playing Pokémon.

Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.

Claude Code

Claude Code, now generally available, brings the power of Claude to more of your development workflow—in the terminal, your favorite IDEs, and running in the background with the Claude Code SDK.

New beta extensions for VS Code and JetBrains integrate Claude Code directly into your IDE. Claude’s proposed edits appear inline in your files, streamlining review and tracking within the familiar editor interface. Simply run Claude Code in your IDE terminal to install.

Beyond the IDE, we're releasing an extensible Claude Code SDK, so you can build your own agents and applications using the same core agent as Claude Code. We're also releasing an example of what's possible with the SDK: Claude Code on GitHub, now in beta. Tag Claude Code on PRs to respond to reviewer feedback, fix CI errors, or modify code. To install, run /install-github-app from within Claude Code.

Getting started

These models are a large step toward the virtual collaborator—maintaining full context, sustaining focus on longer projects, and driving transformational impact. They come with extensive testing and evaluation to minimize risk and maximize safety, including implementing measures for higher AI Safety Levels like ASL-3.

We're excited to see what you'll create. Get started today on Claude, Claude Code, or the platform of your choice.

As always, your feedback helps us improve.

Read the original article

meetpateltech

Karma: 28457

@Hacker__News
@hacker._news

Comments

By minimaxir 2025-05-2217:1610 reply

An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)

https://docs.anthropic.com/en/docs/about-claude/models/overv...

By lxgr 2025-05-2217:4413 reply

With web search being available in all major user-facing LLM products now (and I believe in some APIs as well, sometimes unintentionally), I feel like the exact month of cutoff is becoming less and less relevant, at least in my personal experience.

The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.

By bredren 2025-05-2218:1310 reply

It still matters for software packages. Particularly python packages that have to do with programming with AI!

They are evolving quickly, with deprecation and updated documentation. Having to correct for this in system prompts is a pain.

It would be great if the models were updating portions of their content more recently than others.

For the tailwind example in parent-sibling comment, should absolutely be as up to date as possible, whereas the history of the US civil war can probably be updated less frequently.

By rafram 2025-05-2218:582 reply

> the history of the US civil war can probably be updated less frequently.

It's already missed out on two issues of Civil War History: https://muse.jhu.edu/journal/42

Contrary to the prevailing belief in tech circles, there's a lot in history/social science that we don't know and are still figuring out. It's not IEEE Transactions on Pattern Analysis and Machine Intelligence (four issues since March), but it's not nothing.

By bredren 2025-05-2220:111 reply

Let us dispel with the notion that I do not appreciate Civil War history. Ashokan Farewell is the only song I can play from memory on violin.

By taydotis 2025-05-2311:591 reply

this unlocked memories in me that were long forgotten. Ashokan Farewell !!

By nickpeterson 2025-05-2316:03

I didn’t recognize it by name and thought, “I wonder if that’s the theme for pbs the civil war…”, imagine my satisfaction after pulling it up ;)

By xp84 2025-05-2219:464 reply

I started reading the first article in one of those issues only to realize it was just a preview of something very paywalled. Why does Johns Hopkins need money so badly that it has to hold historical knowledge hostage? :(

By evanelias 2025-05-2221:35

Johns Hopkins is not the publisher of this journal and does not hold copyright for this journal. Why are you blaming them?

The website linked above is just a way to read journals online, hosted by Johns Hopkins. As it states, "Most of our users get access to content on Project MUSE through their library or institution. For individuals who are not affiliated with a library or institution, we provide options for you to purchase Project MUSE content and subscriptions for a selection of Project MUSE journals."

By ordersofmag 2025-05-2221:132 reply

The journal appears to be published by an office with 7 FTE's which presumably is funded by the money raised by presence of the paywall and sales of their journals and books. Fully-loaded costs for 7 folks is on the order of $750k/year. https://www.kentstateuniversitypress.com/

Someone has to foot that bill. Open-access publishing implies the authors are paying the cost of publication and its popularity in STEM reflects an availability of money (especially grant funds) to cover those author page charges that is not mirrored in the social sciences and humanities.

Unrelatedly given recent changes in federal funding Johns Hopkins is probably feeling like it could use a little extra cash (losing $800 million in USAID funding, overhead rates potential dropping to existential crisis levels, etc...)

By arghwhat 2025-05-235:521 reply

> Open-access publishing implies the authors are paying the cost of publication and its popularity in STEM reflects an availability of money

No it implied the journal not double-dipping by extorting both the author and the reader, while not actually performing any valuable task whatsoever for that money.

By drdeca 2025-05-2318:42

> while not actually performing any valuable task whatsoever for that money.

Like with complaints about landlords not producing any value, I think this is an overstatement? Rather, in both cases, the income they bring in is typically substantially larger than what they contribute, due to economic rent, but they do both typically produce some non-zero value.

By vasco 2025-05-235:021 reply

They could pay them from the $13B endowment they have.

By evanelias 2025-05-2314:46

Johns Hopkins University has an endowment of $13B, but as I already noted above, this journal has no direct affiliation with Johns Hopkins whatsoever so the size of Johns Hopkins' endowment is completely irrelevant here. They just host a website which allows online reading of academic journals.

This particular journal is published by Kent State University, which has an endowment of less than $200 million.

By ChadNauseam 2025-05-2221:111 reply

Isn’t john hopkins a university? I feel like holding knowledge hostage is their entire business model.

By cempaka 2025-05-233:081 reply

Pretty funny to see people posting about "holding knowledge hostage" on a thread about a new LLM version from a company which 100% intends to make that its business model.

By PeterStuer 2025-05-237:221 reply

I'd be ok with a $20 montly sub for access to all the world's academic journals.

By mschuster91 2025-05-238:002 reply

So, yet another permanent rent seeking scheme? That's bad enough for Netflix, D+, YouTube Premium, Spotify and god knows what else that bleeds money every month out of you.

But science? That's something that IMHO should be paid for with tax money, so that it is accessible for everyone without consideration of one's ability to have money that can be bled.

By slantview 2025-05-2317:28

This is exactly the problem that pay-per-use LLM access is causing. It's gating the people who need the information the most and causing a divide between the "haves" and "have nots" but at a much larger potential for dividing us.

Sure for me, $20/mo is fine, in fact, I work on AI systems, so I can mostly just use my employer's keys for stuff. But what about the rest of the world where $20/mo is a huge amount of money? We are going to burn through the environment and the most disenfranchised amongst us will suffer the most for it.

By PeterStuer 2025-05-239:521 reply

The situation we had/have is arguably the result of the 'tax' money system. Governments lavishly funding bloated university administrations that approve equally lavish multi million access deals with a select few publishers for students and staff, while the 'general public' basically had no access at all.

By spookie 2025-05-2311:42

The publishers are the problem. Your solution asks the publisher to extort less money.

Aka not happening.

By pjmlp 2025-05-237:271 reply

Given that I am still coding against Java 17, C# 7, C++17 and such at most work projects, and more recent versions are still the exception, it is quite reasonable.

Few are on jobs where v-latest is always an option.

By jinushaun 2025-05-2311:541 reply

It’s not about the language. I get bit when they recommend old libraries or hallucinate non-existent ones.

By pjmlp 2025-05-2312:02

Hallucination is indeed a problem.

As for the libraries, for using more modern libraries, usually it also requires more recent language versions.

By brylie 2025-05-2310:263 reply

I've had good success with the Context7 model context protocol tool, which allows code agents, like GitHub Copilot, to look up the latest relevant version of library documentation including code snippets: https://context7.com/

By frodo999 2025-06-0517:26

We just launched an alternative called Docfork that just uses 1 API call and wraps up the request (Context7 generally uses 2) since speed was a big priority for us: https://docfork.com

By diggan 2025-05-2311:10

I wonder how necessary that is. I've noticed that while Codex doesn't have any fancy tools like that (as it doesn't have internet access), it instead finds the source of whatever library you pulled in, so in Rust for example it's aware (or finds out) where the source was pulled down, and greps those files to figure out the API on the fly. Seems to work well enough and also works whatever library, private or not, updated 1 minute ago or not.

By roflyear 2025-05-2219:301 reply

It matters even with recent cutoffs, these models have no idea when to use a package or not (if it's no longer maintained, etc)

You can fix this by first figuring out what packages to use or providing your package list, tho.

By diggan 2025-05-2220:34

> these models have no idea when to use a package or not (if it's no longer maintained, etc)

They have ideas about what you tell them to have ideas about. In this case, when to use a package or not, differs a lot by person, organization or even project, so makes sense they wouldn't be heavily biased one way or another.

Personally I'd look at architecture of the package code before I'd look at when the last change was/how often it was updated, and if it was years since last change or yesterday have little bearing (usually) when deciding to use it, so I wouldn't want my LLM assistant to value it differently.

By andrepd 2025-05-2311:131 reply

The fact that things from March are already deprecated is insane.

By krzyk 2025-05-2313:42

sounds like npm and general js ecosystem

By jordanbeiber 2025-05-236:10

Cursor have a nice ”docs” feature for this, that have saved me from battles with constant version reversing actions from our dear LLM overlords.

By MollyRealized 2025-05-2316:32

> whereas the history of the US civil war can probably be updated less frequently.

Depends on which one you're talking about.

By alasano 2025-05-2219:11

The context7 MCP helps with this but I agree.

By paulddraper 2025-05-2220:323 reply

How often are base level libraries/frameworks changing in incomparable ways?

By jinushaun 2025-05-2311:581 reply

In the JavaScript world, very frequently. If latest is 2.8 and I’m coding against 2.1, I don’t want answers using 1.6. This happened enough that I now always specify versions in my prompt.

By paulddraper 2025-05-2315:521 reply

Geez

By alwa 2025-05-2317:201 reply

Normally I’d think of “geez” as a low-effort reply, but my reaction is exactly the same…

What on earth is the maintenance load like in that world these days? I wonder, do JavaScript people find LLMs helpful in migrating stuff to keep up?

By zchrykng 2025-05-2913:57

The better solution would be the JavaScript people stop reinventing the world every few months.

By weq 2025-05-234:18

The more popular a library is, the more times its updated every year, the more it will suffer this fate. You always have refine prompts with specific versions and specific ways of doing things, each will be different on your use case.

By hnlmorg 2025-05-2221:14

That depends on the language and domain.

MCP itself isn’t even a year old.

By toomuchtodo 2025-05-2219:021 reply

Does repo/package specific MCP solve for this at all?

By aziaziazi 2025-05-2219:161 reply

Kind of but not in the same way: the MCP option will increase the discussion context, the training option does not. Armchair expert so confirmation would be appreciated.

By toomuchtodo 2025-05-2219:271 reply

Same, I'm curious what it looks like to incrementally or micro train against, if at all possible, frequently changing data sources (repos, Wikipedia/news/current events, etc).

By ironchef 2025-05-232:52

Folks often use things like LoRAs for that.

By jacob019 2025-05-2217:55

Valid. I suppose the most annoying thing related to the cutoffs, is the model's knowledge of library APIs, especially when there are breaking changes. Even when they have some knowledge of the most recent version, they tend to default to whatever they have seen the most in training, which is typically older code. I suspect the frontier labs have all been working to mitigate this. I'm just super stoked, been waiting for this one to drop.

By drogus 2025-05-2221:26

In my experience it really depends on the situation. For stable APIs that have been around for years, sure, it doesn't really matter that much. But if you try to use a library that had significant changes after the cutoff, the models tend to do things the old way, even if you provide a link to examples with new code.

By myfonj 2025-05-2312:54

For the recent resources it might matter: unless the training data are curated meticulously, they may be "spoiled" by the output of other LLM, or even the previous version of the one that is being trained. That's something what is generally considered dangerous, because it could potentially produce unintentional echo-chamber or even somewhat "incestuously degenerated" new model.

By jgalt212 2025-05-2314:06

> The models I'm regularly using are usually smart enough to figure out that they should be pulling in new information for a given topic.

Fair enough, but information encoded in the model is return in milliseconds, information that needs to be scraped is returned in 10s of seconds.

By GardenLetter27 2025-05-2219:331 reply

I've had issues with Godot and Rustls - where it gives code for some ancient version of the API.

By Aeolun 2025-05-2223:121 reply

> some ancient version of the API

One and a half years old shudders

By Tostino 2025-05-230:44

When everything it is trying to use is deprecated, yeah it matters.

By iLoveOncall 2025-05-2222:441 reply

Web search isn't desirable or even an option in a lot of use cases that involve GenAI.

It seems people have turned GenAI into coding assistants only and forget that they can actually be used for other projects too.

By lanstin 2025-05-2317:401 reply

That's because between the two approaches "explain me this thing" or "write code to demonstrate this thing" the LLMs are much more useful on the second path. I can ask it to calculate some third derivatives, or I can ask it to write Mathematica notebook to calculate the same derivatives, and the latter is generally correct and extremely useful as is - the former requires me to scrutinize each line of logic and calculation very carefully.

It's like https://www.youtube.com/watch?v=zZr54G7ec7A where Prof. Tao uses claude to generate Lean4 proofs (which are then verifiable by machine). Great progress, very useful. While the LLM only approachs are still lacking utility for the top minds: https://mathstodon.xyz/@tao/113132502735585408

By iLoveOncall 2025-05-2319:541 reply

You have a narrow imagination. I'm talking about using GenAI for non-CS related applications, like it was advertised a year or so ago.

By lanstin 2025-05-245:57

Lacking a rigorous way to verify truth, I would be pretty wary of using it for truly important things without a human to validate.

And math research is a non-CS application, for the pedants :)

By guywithahat 2025-05-2219:041 reply

I was thinking that too, grok can comment on things that have only just broke out hours earlier, cutoff dates don't seem to matter much

By DonHopkins 2025-05-237:211 reply

Yeah, it seems pretty up-to-date with Elon's latest White Genocide and Holocaust Denial conspiracy theories, but it's so heavy handed about bringing them up out of the blue and pushing them in the middle of discussions about the Zod 4 and Svelte 5 and Tailwind 4 that I think those topics are coming from its prompts, not its training.

By drilbo 2025-05-2316:051 reply

while this is obviously a very damning example, tbf it does seem to be an extreme outlier.

By DonHopkins 2025-05-2317:04

Well Elon Musk is definitely an extremist, and he's certainly a bald faced out liar, and he's obviously the tin pot dictator of the prompt. So you have a great point.

Poor Grok is stuck in the middle of denying the Jewish Holocaust on one hand, while fabricating the White Genocide on the other hand.

No wonder it's so confused and demented, and wants to inject its cognitive dissonance into every conversation.

By Kostic 2025-05-238:00

It's relevant from an engineering perspective. They have a way to develop a new model in months now.

By dzhiurgis 2025-05-231:31

Ditto. Twitter's Grok is especially good at this.

By lobochrome 2025-05-237:46

It knows uv now

By tzury 2025-05-233:25

web search is an immediate limited operation training is a petabytes long term operation

By BeetleB 2025-05-2220:56

Web search is costlier.

By tristanb 2025-05-2218:561 reply

Nice - it might know about Svelte 5 finally...

By brulard 2025-05-2219:442 reply

It knows about Svelte 5 for some time, but it particularly likes to mix it with Svelte 4 in very weird and broken ways.

By rxtexit 2025-05-2311:42

I have experienced this for various libraries. I think it helps to paste in a package.json in the prompt.

All the models seem to struggle with React three fiber like this. Mixing and matching versions that don't make sense. I can see this being a tough problem given the nature of these models and the training data.

I am going to also try to start giving it a better skeleton to start with and stick to the particular imports when faced with this issue.

My very first prompt with claude 4 was for R3F and it imported a depreciated component as usual.

We can't expect the model to read our minds.

By DonHopkins 2025-05-237:23

Or worse yet, React!

By liorn 2025-05-2217:554 reply

I asked it about Tailwind CSS (since I had problems with Claude not aware of Tailwind 4):

> Which version of tailwind css do you know?

> I have knowledge of Tailwind CSS up to version 3.4, which was the latest stable version as of my knowledge cutoff in January 2025.

By threeducks 2025-05-2218:292 reply

> Which version of tailwind css do you know?

LLMs can not reliably tell whether they know or don't know something. If they did, we would not have to deal with hallucinations.

By redman25 2025-05-2313:331 reply

They can if they've been post trained on what they know and don't know. The LLM can first been given questions to test its knowledge and if the model returns a wrong answer, it can be given a new training example with an "I don't know" response.

By dingnuts 2025-05-2313:54

Oh that's a great idea, just do that for every question the LLM doesn't know the answer to!

That's.. how many questions? Maybe if one model generates all possible questions then

By nicce 2025-05-2219:421 reply

We should use the correct term: to not have to deal with bullshit.

By dudeinjapan 2025-05-238:561 reply

I think “confabulation” is the best term.

“Hallucination” is seeing/saying something that a sober person clearly knows is not supposed to be there, e.g. “The Vice President under Nixon was Oscar the Grouch.”

Harry Frankfurt defines “bullshitting” as lying to persuade without regard to the truth. (A certain current US president does this profusely and masterfully.)

“Confabulation” is filling the unknown parts of a statement or story with bits that sound as-if they could be true, i.e. they make sense within the context, but are not actually true. People with dementia (e.g. a certain previous US president) will do this unintentionally. Whereas the bullshitter generally knows their bullshit to be false and is intentionally deceiving out of self-interest, confabulation (like hallucination) can simply be the consequence of impaired mental capacity.

By nicce 2025-05-2311:071 reply

I think the Frankfurt definition is a bit off.

E.g. from the paper ChatGPT is bullshit [1],

> Frankfurt understands bullshit to be characterized not by an intent to deceive but instead by a reckless disregard for the truth.

That is different than defining "bullshitting" as lying. I agree that "confabulation" could otherwise be more accurate. But with previous definition they are kinda synonyms? And "reckless disregard for the truth" may hit closer. The paper has more direct quotes about the term.

[1] https://link.springer.com/article/10.1007/s10676-024-09775-5

By dudeinjapan 2025-05-244:47

You're right. It's "intent to persuade with a reckless disregard for the truth." But even by this definition, LLMs are not (as far as we know) trying to persuade us of anything, beyond the extent that persuasion is a natural/structural feature of all language.

By SparkyMcUnicorn 2025-05-2218:022 reply

Interesting. It's claiming different knowledge cutoff dates depending on the question asked.

"Who is president?" gives a "April 2024" date.

By ethbr1 2025-05-2218:182 reply

Question for HN: how are content timestamps encoded during training?

By cma 2025-05-2221:291 reply

Claude 4's system prompt was published and contains:

"Claude’s reliable knowledge cutoff date - the date past which it cannot answer questions reliably - is the end of January 2025. It answers all questions the way a highly informed individual in January 2025 would if they were talking to someone from {{currentDateTime}}, "

https://docs.anthropic.com/en/release-notes/system-prompts#m...

By polynomial 2025-05-2323:19

I thought best guesses were that Claude's system prompt ran to tens of thousands of tokens, with figures like 30,000 tokens being bandied about.

But the documentation page linked here doesn't bear that out. In fact the Claude 3.7 system prompt on this page clocks in at significantly less than 4,000 tokens.

By tough 2025-05-2218:342 reply

they arent.

a model learns words or tokens more pedantically but has no sense of time nor cant track dates

By svachalek 2025-05-2218:49

Yup. Either the system prompt includes a date it can parrot, or it doesn't and the LLM will just hallucinate one as needed. Looks like it's the latter case here.

By manmal 2025-05-2218:512 reply

Technically they don’t, but OpenAI must be injecting the current date and time into the system prompt, and Gemini just does a web search for the time when asked.

By tough 2025-05-2219:021 reply

right but that's system prompting / in context

not really -trained- into the weights.

the point is you can't ask a model what's his training cut off date and expect a reliable answer from the weights itself.

closer you could do is have a bench with -timed- questions that could only know if had been trained for that, and you'd had to deal with hallucinations vs correctness etc

just not what llm's are made for, RAG solves this tho

By stingraycharles 2025-05-230:401 reply

What would the benefits be of actual time concepts being trained into the weights? Isn’t just tokenizing the dates and including those as normal enough to yield benefits?

E.g. it probably has a pretty good understanding between “second world war” and the time period it lasted. Or are you talking about the relation between “current wall clock time” and questions being asked?

By tough 2025-05-237:49

there's actually some work on training transformer models on time series data which is quite interesting (for prediction purposes)

see google TimesFM: https://github.com/google-research/timesfm

what i mean i guess is llms can -reason- linguistically about time manipulating language, but can't really experience it. a bit like physics. thats why they do bad on exercises/questions about physics/logic that their training corpus might not have seen.

By tough 2025-05-2219:031 reply

OpenAI injects a lot of stuff, your name, sub status, recent threads, memory, etc

sometimes its interesting to peek up under the network tab on dev tools

By Tokumei-no-hito 2025-05-2220:262 reply

strange they would do that client side

By diggan 2025-05-2220:36

Different teams who work backend/frontend surely, and the people experimenting on the prompts for whatever reason wanna go through the frontend pipeline.

By tough 2025-05-2221:13

its just like extra metadata associated with your account not much else

By dawnerd 2025-05-2219:14

I did the same recently with copilot and it of course lied and said it knew about v4. Hard to trust any of them.

By PeterStuer 2025-05-237:28

Did you try giving it the relevant parts of the tailwind 4 documentation in the prompt context?

By Phelinofist 2025-05-2220:202 reply

Why can't it be trained "continuously"?

By cma 2025-05-2221:311 reply

Catastrophic forgetting

https://en.wikipedia.org/wiki/Catastrophic_interference

By DonHopkins 2025-05-237:42

Fascinating, thank for that link! I was reading the sub-sections of the Proposed Solutions / Rehearsal section, thinking it seemed a lot like dreaming, then got to the Spontaneous replay sub-section:

>Spontaneous replay

>The insights into the mechanisms of memory consolidation during the sleep processes in human and animal brain led to other biologically inspired approaches. While declarative memories are in the classical picture consolidated by hippocampo-neocortical dialog during NREM phase of sleep (see above), some types of procedural memories were suggested not to rely on the hippocampus and involve REM phase of the sleep (e.g.,[22] but see[23] for the complexity of the topic). This inspired models where internal representations (memories) created by previous learning are spontaneously replayed during sleep-like periods in the network itself[24][25] (i.e. without help of secondary network performed by generative replay approaches mentioned above).

The Electric Prunes - I Had Too Much To Dream (Last Night):

https://www.youtube.com/watch?v=amQtlkdQSfQ

By AlexCoventry 2025-05-230:27

It's really not necessary, with retrieval-augmented generation. It can be trained to just check what the latest version is.

By m3kw9 2025-05-2217:312 reply

Even that, we don’t know what got updated and what didn’t. Can we assume everything that can be updated is updated?

By diggan 2025-05-2217:371 reply

> Can we assume everything that can be updated is updated?

What does that even mean? Of course an LLM doesn't know everything, so it we wouldn't be able to assume everything got updated either. At best, if they shared the datasets they used (which they won't, because most likely it was acquired illegally), you could make some guesses what they tried to update.

By therein 2025-05-2217:551 reply

> What does that even mean?

I think it is clear what he meant and it is a legitimate question.

If you took a 6 year old and told him about the things that happened in the last year and sent him off to work, did he integrate the last year's knowledge? Did he even believe it or find it true? If that information was conflicting what he knew before, how do we know that the most recent thing he is told he will take as the new information? Will he continue parroting what he knew before this last upload? These are legitimate questions we have about our black box of statistics.

By aziaziazi 2025-05-2219:351 reply

Interesting, I read GGP as:

If they stopped learning (=including) at march 31 and something popup on the internet on march 30 (lib update, new Nobel, whatever) there’s many chances it got scrapped because they probably don’t scrap everything in one day (do they ?).

That isn’t mutually exclusive with your answer I guess.

edit: thanks adolph to point out the typo.

By adolph 2025-05-2219:51

Maybe I'm old school but isn't the date the last date for inclusion in the training corpus and not the date "they stopped training"?

By simlevesque 2025-05-2217:343 reply

You might be able to ask it what it knows.

By minimaxir 2025-05-2217:411 reply

So something's odd there. I asked it "Who won Super Bowl LIX and what was the winning score?" which was in February and the model replied "I don't have information about Super Bowl LIX (59) because it hasn't been played yet. Super Bowl LIX is scheduled to take place in February 2025.".

By ldoughty 2025-05-2217:532 reply

With LLMs, if you repeat something often enough, it becomes true.

I imagine there's a lot more data pointing to the super bowl being upcoming, then the super bowl concluding with the score.

Gonna be scary when bot farms are paid to make massive amounts of politically motivated false content (specifically) targeting future LLMs training

By gosub100 2025-05-2218:141 reply

A lot of people are forecasting the death of the Internet as we know it. The financial incentives are too high and the barrier of entry is too low. If you can build bots that maybe only generate a fraction of a dollar per day (referring people to businesses, posting spam for elections, poisoning data collection/web crawlers), someone in a poor country will do it. Then, the bots themselves have value which creates a market for specialists in fake profile farming.

I'll go a step further and say this is not a problem but a boon to tech companies. Then they can sell you a "premium service" to a walled garden of only verified humans or bot-filtered content. The rest of the Internet will suck and nobody will have incentive to fix it.

By birn559 2025-05-2218:471 reply

I believe identity providers will become even more important in the future as a consequence and that there will be an arm race (hopefully) ending with most people providing them some kind of official id.

By gosub100 2025-05-2219:021 reply

It might slow them down, but integration of the government into online accounts will have its own set of consequences. Some good, of course. But can chill free speech and become a huge liability for whoever collects and verifies the IDs. One hack (say of the government ID database) would spoil the whole system.

By birn559 2025-05-237:10

I agree, this would have very bad consequences regarding free speech and democracy. Next step after that would be a reestablishing of pseudonymously platforms, going full circle.

By dr-smooth 2025-05-2218:20

I'm sure it's already happening.

By krferriter 2025-05-2218:24

Why would you trust it to accurately say what it knows? It's all statistical processes. There's no "but actually for this question give me only a correct answer" toggle.

By retrofuturism 2025-05-2217:38

When I try Claude Sonnet 4 via web:

https://claude.ai/share/59818e6c-804b-4597-826a-c0ca2eccdc46

>This is a topic that would have developed after my knowledge cutoff of January 2025, so I should search for information [...]

By indigodaddy 2025-05-2218:531 reply

Should we not necessarily assume that it would have some FastHTML training with that March 2025 cutoff date? I'd hope so but I guess it's more likely that it still hasn't trained on FastHTML?

By jph00 2025-05-2222:24

Claude 4 actually knows FastHTML pretty well! :D It managed to one-shot most basic tasks I sent its way, although it makes a lot of standard minor n00b mistakes that make its code a bit longer and more complex than needed.

I've nearly finished writing a short guide which, when added to a prompt, gives quite idiomatic FastHTML code.

By VectorLock 2025-05-231:33

I'm starting to wonder if having more recent cut-off dates is more a bug than a feature.

By dvfjsdhgfv 2025-05-2218:193 reply

One thing I'm 100% is that a cut off date doesn't exist for any large model, or rather there is no single date since it's practically almost impossible to achieve that.

By sib 2025-05-2316:221 reply

But I think the general meaning of a cutoff date, D, is:

The model includes nothing AFTER date D

and not

The model includes everything ON OR BEFORE date D

Right? Definitionally, the model can't include anything that happened after training stopped.

By dvfjsdhgfv 2025-05-2321:05

That's correct. However, it is almost meaningless in practice as it might as well mean that, say, 99,99% of the content is 2 years old and older, and only 0,01 was trained just before that date. So if you need functionality that's dependent on new information, you have to test it for each particular component you need.

Unfortunately I work with new APIs all the time and the cutoff date is of no much use.

By koolba 2025-05-2218:233 reply

Indeed. It’s not possible stop the world and snapshot the entire internet in a single day.

Or is it?

By tough 2025-05-2218:35

you would have an append only incremental backup snapshot of the world

By gf000 2025-05-2316:12

You can trivially maximal bound it, though. If the training finished today, then today is a cutoff date.

By dragonwriter 2025-05-2316:25

That's... not what a cutoff date means. Cutoff date is an upper bound, not a promise that the model is trained on every piece of information set in a fixed form before that date.

By tonyhart7 2025-05-2218:41

its not a definitive "date" you cut off information, but more a "recent" material you can feed, training takes times

if you waiting for a new information, of course you are not going ever to train

By cma 2025-05-2218:06

When I asked the model it told me January (for sonnet 4). Doesn't it normally get that in its system prompt?

By SparkyMcUnicorn 2025-05-2217:511 reply

Although I believe it, I wish there was some observability into what data is included here.

Both Sonnet and Opus 4 say Joe Biden is president and claim their knowledge cutoff is "April 2024".

By Tossrock 2025-05-2218:321 reply

Are you sure you're using 4? Mine says January 2025: https://claude.ai/share/9d544e4c-253e-4d61-bdad-b5dd1c2f1a63

By SparkyMcUnicorn 2025-05-2223:301 reply

100% sure. Tested in the Anthropic workbench[0] to double check and got the same result.

The web interface has a prompt that defines a cutoff date and who's president[1].

[0] https://console.anthropic.com/workbench

[1] https://docs.anthropic.com/en/release-notes/system-prompts#c...

By mikeshi42 2025-05-233:472 reply

Can confirm the workbench does with `claude-sonnet-4-20250514` returns Biden (with a claimed April 2024 cutoff date) while Chat returns Trump (as encoded in the system prompt, with no cutoff date mention). Interesting

By jaapz 2025-05-2311:081 reply

They encoded that trump is president in the system prompt? That's afwully specific information to put in the system prompt

By rafram 2025-05-2315:55

Most of their training data says that Biden is president, because it was created/scraped pre-2025. AI models have no concept of temporal context when training on a source.

People use "who's the president?" as a cutoff check (sort of like paramedics do when triaging a potential head injury patient!), so they put it into the prompt. If people switched to asking who the CEO of Costco is, maybe they'd put that in the prompt too.

By jaakl 2025-05-234:27

Some models do have US 2025 president election results explicitly given in system prompt. To fool all who use it for cutoff check.

By jasonthorsness 2025-05-2216:457 reply

“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”

Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.

By rco8786 2025-05-2217:331 reply

It could be! But that's also what people said about all the models before it!

By kmacdough 2025-05-2219:291 reply

And they might all be right!

> This tech could lead to...

I don't think he's saying this is the version that will suddenly trigger a Renaissance. Rather, it's one solid step that makes the path ever more promising.

Sure, everyone gets a bit overexcited each release until they find the bounds. But the bounds are expanding, and the need for careful prompt engineering is diminishing. Ever since 3.7, Claude has been a regular part of my process for the mundane. And so far 4.0 seems to take less fighting for me.

A good question would be when can AI take a basic prompt, gather its own requirements and build meaningful PRs off basic prompt. I suspect it's still at least a couple of paradigm shifts away. But those seem to be coming every year or faster.

By sagarpatil 2025-05-233:311 reply

Did you not see the live stream? They took a feature request for excalidraw (table support) and Claude 4 worked on it for 90 minutes and the PR was working as expected. I’m not sure if they were using sonnet or opus.

By andrepd 2025-05-2311:151 reply

Pre-prepared demos don't impress me.

By spiderfarmer 2025-05-2321:512 reply

By that logic sport athletes don’t impress you. Movies don’t impress you. Theater doesn’t impress you. Your date won’t impress you. Becoming a parent won’t impress you.

Most things in life take years of preparation.

By xaoz 2025-05-241:51

The issue with preprepared demos is that you can carefully curate the scenario and make choices that you know are likely to show the best outcome, out of a wide range of possibilities. If you know your model (or VC product demo etc) performs poorly under certain conditions, you simply avoid them. This is a reason to be somewhat skeptical about demos.

By rco8786 2025-05-2412:50

No, the logic would be that a football player's 40 time doesn't impress, because you want to see the on field performance. Or a movie trailer doesn't impress, because it's only meant to get you to watch the movie which might be trash. A first date won't impress you, it will take multiple dates to understand and love a human.

etc.

By max_on_hn 2025-05-2218:492 reply

I am incredibly eager to see what affordable coding agents can do for open source :) in fact, I should really be giving away CheepCode[0] credits to open source projects. Pending any sort of formal structure, if you see this comment and want free coding agent runs, email me and I’ll set you up!

[0] My headless coding agents product, similar to “assign to copilot” but works from your task board (Linear, Jira, etc) on multiple tasks in parallel. So far simple/routine features are already quite successful. In general the better the tests, the better the resulting code (and yes, it can and does write its own tests).

By troupo 2025-05-237:481 reply

> I am incredibly eager to see what affordable coding agents can do for open source :)

Oh, we know exactly what they will do: they will drive devs insane: https://www.reddit.com/r/ExperiencedDevs/comments/1krttqo/my...

By losvedir 2025-05-2319:321 reply

I dunno, looking through those issues I'd be more annoyed by all the randos grandstanding in my PRs.

By troupo 2025-05-2320:15

And not all the "fix this - i fixed - no you didn't - here's the fix — there's no fix" back and forth with the AI?

There's very little grandstanding in the comments. They are all very tame, all things considered.

By dr_dshiv 2025-05-237:251 reply

Especially since the EU just made open source contributors liable for cybersecurity (Cyber Resilience Act). Just let AI contribute and ur good

By hn111 2025-05-237:572 reply

Didn’t they make an exception for open-source projects? https://opensource.org/blog/the-european-regulators-listened...

By dr_dshiv 2025-05-2316:081 reply

“Anyone opensourcing anything while in the course of ‘commercial activity’ will be fully liable. Effectively they rugpulled the Apache2 / MIT licenses... all opensource released by small businesses is fucked. where the was no red tape now there is infinite liability”

This is my current understanding, from a friend not a lawyer. Would appreciate any insight from folks here.

By spiderfarmer 2025-05-2321:531 reply

So you’re willfully spreading FUD in hopes someone will enlighten you?

By dr_dshiv 2025-05-2322:581 reply

The law is real. What emotions would you suggest are appropriate?

By spiderfarmer 2025-05-244:411 reply

Open Source is exempt, provided you don’t make a profit: https://kevinboone.me/open_source_liability.html

By dr_dshiv 2025-05-248:301 reply

So it applies to anyone who figures out how to monetize open-source contributions. Seems like a major issue to me. Not exactly something that makes Europe a good place for tech.

By spiderfarmer 2025-05-265:07

The US has the reputation that rich competitors will abuse the judicial system to sue you into bankruptcy. Still, a lot of people want to start their tech startup there.

By andrepd 2025-05-2311:171 reply

Yeah, just the usual hn FUD about the EU.

By throwaway894345 2025-05-2316:22

Can you reconcile that with this sibling comment?

https://news.ycombinator.com/item?id=44074070

I don’t have an opinion, just trying to make sense of contradictory claims.

By ModernMech 2025-05-2216:593 reply

That's kind of my benchmark for whether or not these models are useful. I've got a project that needs some extensive refactoring to get working again. Mostly upgrading packages, but also it will require updating the code to some new language semantics that didn't exist when it was written. So far, current AI models can make essentially zero progress on this task. I'll keep trying until they can!

By yosito 2025-05-2217:294 reply

Personally, I don't believe AI is ever going to get to that level. I'd love to be proven wrong, but I really don't believe that an LLM is the right tool for a job that requires novel thinking about out of the ordinary problems like all the weird edge cases and poor documentation that comes up when trying to upgrade old software.

By 9dev 2025-05-2217:444 reply

Actually, I think the opposite: Upgrading a project that needs dependency updates to new major versions—let’s say Zod 4, or Tailwind 3—requires reading the upgrade guides and documentation, and transferring that into the project. In other words, transforming text. It’s thankless, stupid toil. I’m very confident I will not be doing this much more often in my career.

By mikepurvis 2025-05-2217:562 reply

Absolutely, this should be exactly the kind of task a bot should be perfect for. There's no abstraction, no design work, no refactoring, no consideration of stakeholders, just finding instances of whatever is old and busted and changing it for the new hotness.

By maoberlehner 2025-05-237:52

It seems logical, but still, my experience is the complete opposite. I think that it is an inherent problem with the technology. "Upgrade from Library v4 to Library v5" probably heavily triggers all the weights related to "Library," which most likely is a cocktail of all the training data from all the versions (makes me wonder how LLMs are even as good as they are at writing code with one version consistently - I assume because the weights related to a particular version become reinforced by every token matching the syntax of a particular version - and I guess this is the problem for those kinds of tasks).

For the (complex) upgrade use case, LLMs fail completely in my tests. I think in this case, the only way it can succeed is by searching (and finding!) for an explicit upgrade guide that describes how to upgrade from version v4 to v5 with all the edge cases relevant for your project in it.

More often than not, a guide like this just does not exist. And then you need (human?) ingenuity, not just "rename `oldMethodName` to `newMethodName` (when talking about a major upgrade like Angular 0 to Angular X or Vue 2 to Vue 3 and so on).

By dvfjsdhgfv 2025-05-2219:341 reply

So that was my conviction, too. However, in my tests it seems like upgrading to a version a model hasn't seen is for some reason problematic, in spite of giving it the complete docs, examples of new API usage etc. This happens even with small snippets, even though they can deal with large code fragments with older APIs they are very "familiar" with.

By mikepurvis 2025-05-2220:06

Okay so less of a "this isn't going to work at all" and more just not ready for prime-time yet.

By cardanome 2025-05-2218:191 reply

Theoretically we don't even need AI. If semantics were defined well enough and maintainers actually concerned about and properly tracking breaking changes we could have tools that automatically upgrade our code. Just a bunch of simple scripts that perform text transformations.

The problem is purely social. There are language ecosystems where great care is taken to not break stuff and where you can let your project rot for a decade or two and still come back to and it will perfectly compile with the newest release. And then there is the JS world where people introduce churn just for the sake of their ego.

Maintaining a project is orders of magnitudes more complex than creating a new green field project. It takes a lot of discipline. There is just a lot, a lot of context to keep in mind that really challenges even the human brain. That is why we see so many useless rewrites of existing software. It is easier, more exciting and most importantly something to brag about on your CV.

Ai will only cause more churn because it makes it easier to create more churn. Ultimately leaving humans with more maintenance work and less fun time.

By afavour 2025-05-2218:301 reply

> and maintainers actually concerned about and properly tracking breaking changes we could have tools that automatically upgrade our code

In some cases perhaps. But breaking changes aren’t usually “we renamed methodA to methodB”, it’s “we changed the functionality for X,Y, Z reasons”. It would be very difficult to somehow declaratively write out how someone changes their code to accommodate for that, it might change their approach entirely!

By mdaniel 2025-05-2218:48

There are programmatic upgrade tools, some projects ship them even right now https://github.com/codemod-com/codemod

I think there are others in that space but that's the one I knew of. I think it's a relevant space for Semgrep, too, but I don't know if they are interested in that case

By yosito 2025-05-230:201 reply

That assumes accurate documentation, upgrade guides that cover every edge case, and the miracle of package updates not causing a cascade of unforeseen compatibility issues.

By crummy 2025-05-231:27

There might be a lot of prior work out there to train on though.

There's some software out there that's supposed to help with this kind of thing for Java upgrades already: https://github.blog/changelog/2025-05-19-github-copilot-app-...

By MobiusHorizons 2025-05-233:37

Except that for breaking changes you frequently need to know why it was done the old way in order to know what behavior it ago have after the update.

By csomar 2025-05-234:43

That's the easiest task for an LLM to do. Upgrading from x.y to z.y is for the most part syntax changes. The issue is that most of the documentation sucks. The LLM issue is that it doesn't have access to that documentation in the first place. Coding LLMs should interact with LSPs like humans do. You ask the LSP for all possible functions, you read the function docs and then you type from the available list of options.

LLMs can in theory do that but everyone is busy burning GPUs.

By dakna 2025-05-2219:221 reply

Google demoed an automated version upgrade for Android libraries during I/O 2025. The agent does multiple rounds and checks error messages during each build until all dependencies work together.

Agentic Experiences: Version Upgrade Agent

https://youtu.be/ubyPjBesW-8?si=VX0MhDoQ19Sc3oe-

By yosito 2025-05-230:18

So it works in controlled and predictable circumstances. That doesn't mean it works in unknown circumstances.

By tmpz22 2025-05-2217:073 reply

And IMO it has a long way to go. There is a lot of nuance when orchestrating dependencies that can cause subtle errors in an application that are not easily remedied.

For example a lot of llms (I've seen it in Gemini 2.5, and Claude 3.7) will code non-existent methods in dynamic languages. While these runtime errors are often auto-fixable, sometimes they aren't, and breaking out of an agentic workflow to deep dive the problem is quite frustrating - if mostly because agentic coding entices us into being so lazy.

By mikepurvis 2025-05-2218:001 reply

"... and breaking out of an agentic workflow to deep dive the problem is quite frustrating"

Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

Or better yet, the bot is able to recognize its own limitations and proactively surface these instances, be like hey human I'm not sure what to do in this case; based on the docs I think it should be A or B, but I also feel like C should be possible yet I can't get any of them to work, what do you think?

As humans, it's perfectly normal to put up a WIP PR and then solicit this type of feedback from our colleagues; why would a bot be any different?

By dvfjsdhgfv 2025-05-2219:371 reply

> Maybe that's the problem that needs solving then? The threshold doesn't have to be "bot capable of doing entire task end to end", like it could also be "bot does 90% of task, the worst and most boring part, human steps in at the end to help with the one bit that is more tricky".

Still, the big short-term danger being you're left with code that seems to work well but has subtle bugs in it, and the long-term danger is that you're left with a codebase you're not familiar with.

By mikepurvis 2025-05-2219:57

Being left with an unfamiliar codebase is always a concern and comes about through regular attrition, particularly if inadequate review is not in place or people are cycling in and out of the org too fast for proper knowledge transfer (so, cultural problems basically).

If anything, I'd bet that agent-written code will get better review than average because the turn around time on fixes is fast and no one will sass you for nit-picking, so it's "worth it" to look closely and ensure it's done just the way you want.

By jasonthorsness 2025-05-2217:43

The agents will definitely need a way to evaluate their work just as well as a human would - whether that's a full test suite, tests + directions on some manual verification as well, or whatever. If they can't use the same tools as a human would they'll never be able to improve things safely.

By soperj 2025-05-2219:481 reply

> if mostly because agentic coding entices us into being so lazy.

Any coding I've done with Claude has been to ask it to build specific methods, if you don't understand what's actually happening, then you're building something that's unmaintainable. I feel like it's reducing typing and syntax errors, sometime it leads me down a wrong path.

By weq 2025-05-234:25

I can just imagine it now, you launch your AI coded first product and get a bug in production, and the only way the AI can fix the bug is to re-write and deploy the app with a different library. Your then proceed to show the changelog to the CCB for approval including explaining the fix to the client trying to explain its risk profile for their signoff.

"Yeh, we solved the duplicate name appearing the table issue by moving databases engines and UI frameworks to ones more suited to the task"

By mewpmewp2 2025-05-2312:31

I think this type of thing needs agent which has access to the documentation to read about nuances of the language and package versions, definitely a way to investigate types, interfaces. Problem is that training data has so much mixed data it can easily confuse the AI to mix up versions, APIs etc.

By epolanski 2025-05-239:50

> having package upgrades and other mostly-mechanical stuff handled automatically

Those are already non-issues mostly solved by bots.

In any case, where I think AI could help here would be by summarizing changes, conflicts, impact on codebase and possibly also conduct security scans.

By BaculumMeumEst 2025-05-2216:592 reply

Anyone see news of when it’s planned to go live in copilot?

By vinhphm 2025-05-2217:062 reply

The option just shown up in Copilot settings page for me

By bbor 2025-05-2219:223 reply

Turns out Opus 4 starts at their $40/mo ("Pro+") plan which is sad, and they serve o4-mini and Gemini as well so it's a bit less exclusive than this announcement implies. That said, I have a random question for any Anthropic-heads out there:

GitHub says "Claude Opus 4 is hosted by Anthropic PBC. Claude Sonnet 4 is hosted by Anthropic 1P."[1]. What's Anthropic 1P? Based on the only Kagi result being a deployment tutorial[2] and the fact that GitHub negotiated a "zero retention agreement" with the PBC but not whatever "1P" is, I'm assuming it's a spinoff cloud company that only serves Claude...? No mention on the Wikipedia or any business docs I could find, either.

Anyway, off to see if I can access it from inside SublimeText via LSP!

[1] https://docs.github.com/en/copilot/using-github-copilot/ai-m...

[2] https://github.com/anthropics/prompt-eng-interactive-tutoria...

By Workaccount2 2025-05-2220:16

Google launched Jules two days ago, which is the gemini coding agent[1]. I was pretty quickly accepted into the beta and you get 5 free tasks a day.

So far I have found it pretty powerful, its also the first time an LLM has ever stopped while working to ask me a question or for clarification.

[1]https://jules.google/

By l1n 2025-05-2219:57

1P = Anthropic's first party API, e.g. not through Bedrock or Vertex

By Barbing 2025-05-3019:38

Interesting, first link changed now:

"Claude Opus 4 and Claude Sonnet 4 are hosted by Anthropic PBC and Google Cloud Platform."

They also mention:

"GitHub has provider agreements in place to ensure data is not used for training."

They go on to elaborate. Perhaps this kind of offering instills confidence in some who might not trust model providers 1:1, but believe they will respect their contract with a large customer like Microsoft (GitHub).

By BaculumMeumEst 2025-05-2217:091 reply

Same! Rock and roll!

By denysvitali 2025-05-2217:49

I got rate-limited in like 5 seconds. Wow

By minimaxir 2025-05-2217:171 reply

The keynote confirms it is available now.

By jasonthorsness 2025-05-2217:431 reply

Gotta love keynotes with concurrent immediate availability

By brookst 2025-05-2219:591 reply

Not if you work there

By echelon 2025-05-2220:161 reply

That's just a few weeks of DR + prep, a feature freeze, and oncall with bated breath.

Nothing any rank and file hasn't been through before with a company that relies on keynotes and flashy releases for growth.

Stressful, but part and parcel. And well-compensated.

By brookst 2025-05-231:39

Sometimes. When things work great.

Sometimes you just hear “BTW your previously-soft-released feature will be on stage day after tomorrow, probably don’t make any changes until after the event, and expect 10x traffic”

By phito 2025-05-2319:52

I don't see how a LLM could do better than a bot, eg renovate

By ed_elliott_asc 2025-05-2220:44

Until it pushes a severe vulnerability which takes a big service doen

By Doohickey-d 2025-05-2216:548 reply

> Users requiring raw chains of thought for advanced prompt engineering can contact sales

So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.

In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.

By a_bonobo 2025-05-235:012 reply

Could the exclusion of CoT that be because of this recent Anthropic paper?

https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...

>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.

I.e., chain of thought may be a confabulation by the model, too. So perhaps there's somebody at Anthropic who doesn't want to mislead their customers. Perhaps they'll come back once this problem is solved.

By whimsicalism 2025-05-237:26

i think it is almost certainly to prevent distillation

By andrepd 2025-05-2311:503 reply

I have no idea what this means, can someone give the eli5?

By a_bonobo 2025-05-2314:25

Anthropic has a nice press release that summarises it in simpler terms: https://www.anthropic.com/research/reasoning-models-dont-say...

By BowBun 2025-05-2312:37

Ask an LLM!

By otabdeveloper4 2025-05-2316:151 reply

I don't either, but chain of thought is obviously bullshit and just more LLM hallucination.

LLMs will routinely "reason" through a solution and then proceed to give out a final answer that is completely unrelated to the preceding "reasoning".

By aqfamnzc 2025-05-2317:191 reply

It's more hallucination in the sense that all LLM output is hallucination. CoT is not "what the llm is thinking". I think of it as just creating more context/prompt for itself on the fly, so that when it comes up with a final response it has all that reasoning in its context window.

By ziofill 2025-05-241:551 reply

Exactly, whether or not it’s the “actual thought” of the model, it does influence its final output, so it matters to the user.

By otabdeveloper4 2025-05-267:141 reply

> it does influence its final output

We don't really know that. So far CoT is only used to sell LLMs to the user. (Both figuratively as a neat trick and literally as a way to increase token count.)

By aqfamnzc 2025-06-0513:26

Not even remotely true. It's part of the context window, so it greatly influences the final output. CoT is tokens generated by the LLM just like normal output.

By BoredPositron 2025-05-2222:322 reply

Because it's alchemy and everyone believes they have an edge on turning lead into gold.

By elcritch 2025-05-233:051 reply

I've been thinking for a couple of months now that prompt engineering, and therefore CoT, is going to become the "secret sauce" companies want to hold onto.

If anything that is where the day to day pragmatic engineering gets done. Like with early chemistry, we didn't need to precisely understand chemical theory to produce mass industrial processes by making a good enough working model, some statistical parameters, and good ole practical experience. People figured out steel making and black powder with alchemy.

The only debate now is whether the prompt engineering models are currently closer to alchemy or modern chemistry? I'd say we're at advanced alchemy with some hints of rudimentary chemistry.

Also, unrelated but with CERN turning lead into gold, doesn't that mean the alchemists were correct, just fundamentally unprepared for the scale of the task? ;)

By parodysbird 2025-05-238:482 reply

The thing with alchemy was not that their hypotheses were wrong (they eventually created chemistry), but that their method of secret esoteric mysticism over open inquiry was wrong.

Newton is the great example of this: he led a dual life, where in one he did science openly to a community to scrutinize, in the other he did secret alchemy in search of the philosopher's stone. History has empirically shown us which of his lives actually led to the discovery and accumulation of knowledge, and which did not.

By iamcurious 2025-05-2310:111 reply

Newton was a smart guy and he devoted a lot of time to his occult research. I bet that a lot of that occult research inspired the physics. The fact that his occult research remains, occult from the public, well that is natural aint it?

By parodysbird 2025-05-2316:23

You can be inspired by anything, that's fine. Gell-mann was amusing himself and getting inspiration from Buddhism for quantum physics. It's the process of the inquiry that generates the knowledge as a discipline, rather than the personal spark for discovery.

By elcritch 2025-05-247:26

That’s what the Illuminati wants you to think. Jk ;)

Gotta admit the occult side does make for much more enjoyable movie and book plot lines though.

By viraptor 2025-05-2310:32

We won't know without an official answer leaking, but a simple answer could be - people spend too much time trying to analyse those without understanding the details. There was a lot of talk on HN about the thinking steps second guessing and contradicting itself. But in practice that step is both trained by explicitly injecting the "however", "but" and similar words and they do more processing than simply interpreting the thinking part as text we read. If the content is commonly misunderstood, why show it?

By pja 2025-05-2217:512 reply

IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.

It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.

This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).

By landl0rd 2025-05-2219:092 reply

Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.

By xp84 2025-05-2220:262 reply

I'm not sure I'd throw out all the alignment baby with the bathwater. But I wish we could draw a distinction between "Might offend someone" with "dangerous."

Even 'plotting terror attacks' is not something terrorists can do just fine without AI. And as for making sure the model wouldn't say ideas that are hurtful to <insert group>, it seems to me so silly when it's text we're talking about. If I want to say "<insert group> are lazy and stupid," I can type that myself (and it's even protected speech in some countries still!) How does preventing Claude from espousing that dumb opinion, keep <insert group> safe from anything?

By landl0rd 2025-05-2220:451 reply

Let me put it this way: there are very few things I can think of that models should absolutely refuse, because there are very few pieces of information that are net harmful in all cases and at all times. I sort of run by blackstone's principle on this: it is better to grant 10 bad men access to information than to deny that access to 1 good one.

Easy example: Someone asks the robot for advice on stacking/shaping a bunch of tannerite to better focus a blast. The model says he's a terrorist. In fact, he's doing what any number of us have done and just having fun blowing some stuff up on his ranch.

Or I raised this one elsewhere but ochem is an easy example. I've had basically all the models claim that random amines are illegal, potentially psychoactive, verboten. I don't really feel like having my door getting kicked down by agents with guns, getting my dog shot, maybe getting shot myself because the robot tattled on me for something completely legal. For that matter if someone wants to synthesize some molly the robot shouldn't tattle to the feds about that either.

Basically it should just do what users tell it to do excepting the very minimal cases where something is basically always bad.

By gmd63 2025-05-234:382 reply

> it is better to grant 10 bad men access to information than to deny that access to 1 good one.

I disagree when it comes to a tool as powerful as AI. Most good people are not even using AI. They are paying attention to their families and raising their children, living real life.

Bad people are extremely interested in AI. They are using it to deceive at scales humanity has never before seen or even comprehended. They are polluting the wellspring of humanity that used to be the internet and turning it into a dump of machine-regurgitated slop.

By hombre_fatal 2025-05-236:48

Yeah, it’s like saying you should be able to install anything on your phone with a url and one click.

You enrich <0.1% of honest power users who might benefit from that feature… and 100% of bad actors… at the expense of everyone else.

It’s just not a good deal.

By landl0rd 2025-05-238:36

1. Those people don’t need frontier models. The slop is slop in part because it’s garbage usually generated by cheap models.

2. It doesn’t matter. Most people at some level have a deontological view of what is right and wrong. I believe it’s wrong to build mass-market systems that can be so hostile to their users interests. I also believe it’s wrong for some SV elite to determine what is “unsafe information”.

Most “dangerous information” has been freely accessible for years.

By eru 2025-05-237:331 reply

Yes.

I used to think that worrying about models offending someone was a bit silly.

But: what chance do we have of keeping ever bigger and better models from eventually turning the world into paper clips, if we can't even keep our small models from saying something naughty.

It's not that keeping the models from saying something naughty is valuable in itself. Who cares? It's that we need the practice, and enforcing arbitrary minor censorship is as good a task as any to practice on. Especially since with this task it's so easy to (implicitly) recruit volunteers who will spend a lot of their free time providing adversarial input.

By landl0rd 2025-05-238:391 reply

This doesn’t need to be so focused on the current set of verboten info though. Just make practice making it not say some set of random less important stuff.

By eru 2025-05-242:501 reply

Focusing on the keeping ChatGPT from talking about (or drawing pictures of) boobies has two advantages:

- companies are eager to put in the work to suppress boobies

- edgy teenagers are eager to put in the work to free the boobies

Practicing with 'random less important stuff' loses these two sources of essentially free labour for alignment research.

By landl0rd 2025-05-244:201 reply

Yeah I really don’t care about this case much. Actually a good example of less important stuff. It’s practical things like nuclear physics (buddy majoring has had it refuse questions), biochem, ochem, energetics & arms, etc. that I dislike.

By eru 2025-05-245:50

Oh, interesting. I hadn't considered censorship in these areas!

By latentsea 2025-05-231:46

That's probably true... right up until it reports you to the police.

By Wowfunhappy 2025-05-2222:442 reply

Correct me if I'm wrong--my understanding is that RHLF was the difference between GPT 3 and GPT 3.5, aka the original ChatGPT.

If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.

Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.

By pja 2025-05-238:04

Oh sure, RLHF instruction tuning was what turned an model of mostly academic interest into a global phenomenon.

But it also compromised model accuracy & performance at the same time: The more you tune to eliminate or reinforce specific behaviours, the more you affect the overall performance of the model.

Hence my speculation that Anthropic is using a chain-of-thought model that has not been alignment tuned to improve performance. This would then explain why you don’t get to see its output without signing up to special agreements. Those agreements presumably explain all this to counter-parties that Anthropic trusts will cope with non-aligned outputs in the chain-of-thought.

By Wowfunhappy 2025-05-231:281 reply

Ugh, I'm past the edit window, but I meant RLHF aka "Reinforced Learning from Human Feedback", I'm not sure how I messed that up not once but twice!

By dwaltrip 2025-05-231:50

After the first mess up, the context was poisoned :)

By sunaookami 2025-05-2217:392 reply

Guess we have to wait till DeepSeek mops the floor with everyone again.

By datpuz 2025-05-2218:335 reply

DeepSeek never mopped the floor with anyone... DeepSeek was remarkable because it is claimed that they spent a lot less training it, and without Nvidia GPUs, and because they had the best open weight model for a while. The only area they mopped the floor in was open source models, which had been stagnating for a while. But qwen3 mopped the floor with DeepSeek R1.

By manmal 2025-05-2218:59

I think qwen3:R1 is apples:oranges, if you mean the 32B models. R1 has 20x the parameters and likely roughly as much knowledge about the world. One is a really good general model, while you can run the other one on commodity hardware. Subjectively, R1 is way better at coding, and Qwen3 is really good only at benchmarks - take a look at aider‘s leaderboard, it’s not even close: https://aider.chat/docs/leaderboards/

R2 could turn out really really good, but we‘ll see.

By barnabee 2025-05-2218:441 reply

They mopped the floor in terms of transparency, even more so in terms of performance × transparency

Long term that might matter more

By infecto 2025-05-2311:32

Ehhh who knows the true motives, it was a great PR move for them though.

By sunaookami 2025-05-2220:14

DeepSeek made OpenAI panic, they initially hid the CoT for o1 and then rushed to release o3 instead of waiting for GPT-5.

By csomar 2025-05-234:50

I disagree. I find myself constantly going to their free offering which was able to solve lots of coding tasks that 3.7 could not.

By codyvoda 2025-05-2218:391 reply

counterpoint: influencers said they wiped the floor with everyone so it must have happened

By sunaookami 2025-05-2220:151 reply

Who cares about what random influencers say?

By infecto 2025-05-2311:331 reply

I think he is hinting at folks like you who say things like Deepseek mopping the floor when beyond some contribution to the open source community which was indeed impressive, there really has been not much of a change. No floors were mopped.

By sunaookami 2025-05-2313:521 reply

See the other comments. There was change. Don't know what that has to do with influencers, I don't follow these people.

By infecto 2025-05-2315:26

No floors were mopped. See comment you replied to. Change happened, their research was great but no floors were mopped.

By infecto 2025-05-2311:30

Do people actually believe this? While I agree their open source contribution was impressive, I never got the sense they mopped the floor. Perhaps firms in China may be using some of their models but beyond learnings in the community, no dents in the market were made for the West.

By epolanski 2025-05-239:51

> because it helped to see when it was going to go down the wrong track

It helped me tremendously learning Zig.

Seeing his chain of thought when asking it stuff about Zig and implementations let me widen the horizon a lot.

By whiddershins 2025-05-233:39

The trend towards opaque is inexorable.

https://noisegroove.substack.com/p/somersaulting-down-the-sl...

By make3 2025-05-2221:15

it just makes it too easy to distill the reasoning into a separate model I guess. though I feel like o3 shows useful things about the reasoning while it's happening

By Aeolun 2025-05-2223:182 reply

The Google CoT is so incredibly dumb. I thought my models had been lobotomized until I realized they must be doing some sort of processing on the thing.

By user_7832 2025-05-235:43

You are referring to the new (few days old-ish) CoT right? It’s bizzare as to why google did it, it was very helpful to see where the model was making assumptions or doing something wrong. Now half the time it feels better to just use flash with no thinking mode but ask it to manually “think”.

By whimsicalism 2025-05-237:271 reply

it’s fake cot, just like oai

By phatfish 2025-05-2313:49

I had assumed it was a way to reduce "hallucinations". Instead of me having to double check every response and prompt it again to clear up the obvious mistakes it just does that in the background with itself for a bit.

Obviously the user still has to double check the response, but less often.