I've read the paper before I made the statement. And I still made the statement because there are issues with their paper. The first problem is that the way in which anthropic trains their models and the architecture of their models is different from most of the open source models people use. they are still transformer based, but they are not structurally put together the same as most models, so you cant extrapolate their findings on their models to other models. Their training methods also use a lot more regularization of the data trying to weed out targeted biases as much as possible. meaning that the models are trained on more synthetic data which tries to normalize the data as much as possible between languages, tone, etc.. Same goes for their system prompt, their system prompt is treated differently versus open source models which append the system prompt in front of the users query internally. The attention ais applied differently among other things. Second the way that their models "internalize" the world is vastly different then what humans would thing of "building a world model" of reality. Its hard to put it in to words but basically their models do have a underlying representative structure but its not anything that would be of use in the domains humans care about, "true reasoning". Grokking the concept if you will. Honestly I highly suggest folks take a lot of what anthropic studies with a grain of salt. I feel that a lot of information they present is purposely misinterpreted by their teams for media or pr/clout or who knows what reasons. But the biggest reason is the one i stated at the beginning, most models are not of the same ilk as Anthropic models. I would suggest folks focus on reading interpretability research on open source models as those are most likely to be used by corporations for their cheap api costs. And those models have no where near the care and sophistication put in to them as anthropic models.
I've done experiments and basically what I found was that LLM models are extremely sensitive to .....language. Well, duh but let me explain a bit. They will give a different quality/accuracy of answer depending on the system prompt order, language use, length, how detailed the examples are, etc... basically every variable you can think of is responsible for either improving or causing detrimental behavior in the output. And it makes sense once you really grok that LLM;s "reason and think" in tokens. They have no internal world representation. Tokens are the raw layer on which they operate. For example if you ask a bilingual human what their favorite color is, the answer will be that color regardless of what language they used to answer that question. For an LLM, that answer might change depending on the language used, because its all statistical data distribution of tokens in training that conditions the response. Anyway i don't want to make a long post here. The good news out of this is that once you have found the best way in asking questions of your model, you can consistently get accurate responses, the trick is to find the best way to communicate with that particular LLM. That's why i am hard at work on making an auto calibration system that runs through a barrage of ways in finding the best system prompts and other hyperparameters for that specific LLM. The process can be fully automated, just need to set it all up.
Understanding that no outside provider is going to care about your privacy and will always choose to try and sell you their crappy advertisements and push their agenda on you is the first step in building a solution. In my opinion that solution will come in the form of a personalized local ai agent which is the gatekeeper of all information the user receives and sends to the outside world. A fully context aware agent that has the users interests in mind and so only provides user agreed context to other ai systems and also filters all information coming to the user from spam, agenda manipulation, etc... Basically a very advanced spam blocker of the future that is 100% local and fully user controlled and calibrated. I think we should all be either working on something like this if we want to keep our sanity in this brave new world.