Tencent's 'Hunyuan-T1'–The First Mamba-Powered Ultra-Large Model

Comments

By AJRF 2025-03-2310:179 reply

Iman Mirzadeh on Machine Learning Street Talk (Great podcast if you haven’t already listened!) put into a words a thought I had - LLM labs are so focused on making those scores go up it’s becoming a bit of a perverse incentive.

If your headline metric is a score, and you constantly test on that score, it becomes very tempting to do anything that makes that score go up - i.e Train on the Test set.

I believe all the major ML labs are doing this now because:

- No one talks about their data set

- The scores are front and center of big releases, but there is very little discussion or nuance other than the metric.

- The repercussions of not having a higher or comparable score is massive failure and your budget will get cut.

More in depth discussion on capabilities - while harder - is a good signal of a release.

By JimDabell 2025-03-2312:331 reply

> LLM labs are so focused on making those scores go up it’s becoming a bit of a perverse incentive.

This seems like an odd comment to post in response to this article.

This is about showing that a new architecture can match the results of more established architectures in a more efficient way. The benchmarks are there to show this. Of course they aren’t going to say “It’s just as good – trust us!”.

By tasn 2025-03-2316:152 reply

He's not advocating for "trust us", he's advocating for more information than just the benchmarks.

Unfortunately, I'm not sure what a solution that can't be gamed may even look like (which is what gp is asking for).

By BrawnyBadger53 2025-03-2319:36

The best thing would be blind preference tests for a wide variety of problems across domains but unfortunately even these can be gamed if desired. The upside is that they are gamed by being explicitly malicious which I'd imagine would result in whistleblowing at some point. However Claude's position on leaderboards outside of webdev arena makes me skeptical.

By JimDabell 2025-03-2412:49

My objection is not towards “advocating for more information”, my objection is towards “so focused on making those scores go up it’s becoming a bit of a perverse incentive”. That type of comment might apply in some other thread about some other release, but it doesn’t belong in this one.

By jononor 2025-03-2311:551 reply

Being _perceived_ as having the best LLM/chatbot is a billion dollar game now. And it is an ongoing race, at breakneck speeds. These companies are likely gaming the metrics in any and all ways that they can. Of course there are probably many working on genuine improvements also. And at the frontier it can be very difficult to separate "hack" from "better generalized performance". But that is much harder, so might be the minority in terms of practical impact already.

It is a big problem for researchers at least that we/they do know what is in the training data and how that process works. Figuring out if there are (for example) data leaks or overeager preference tuning, that caused performance to get better for a given task is extremely difficult with these giganormous black boxes.

By bn-l 2025-03-2312:031 reply

You have potentially billions of dollars to gain, no way to be found out… it’s a good idea to initially assume there’s cheating and work back from there.

By blueboo 2025-03-2312:111 reply

It’s not quite as bad as “no way to be found out”. There are evals that suss out contamination/training on the test set. Science means using every available means to disprove, though. Incredible claims etc

By varelse 2025-03-2319:30

[dead]

By gozzoo 2025-03-2311:051 reply

Intelligence is so vaguely defined and has so many dimensions that it is practically impossible to assess. The only approximation we have is the benchmarks we currently use. It is no surprise that model creators optimize their models for the best results in these benchmarks. Benchmarks have helped us drastically improve models, taking them from a mere gimmick to "write my PhD thesis." Currently, there is no other way to determine which model is better or to identify areas that need improvement.

That is to say, focusing on scores is a good thing. If we want our models to improve further, we simply need better benchmarks.

By pk-protect-ai 2025-03-2318:121 reply

According to this very model there a "mere technicalities" differentiate human and AI systems ...

Current AI lacks:

First-person perspective simulation Continuous self-monitoring (metacognition error <15%) Episodic future thinking (>72h horizon) Episodic Binding (Memory integration): Depends on: Theta-gamma cross-frequency coupling (40Hz phase synchronization) Dentate gyrus pattern separation (1:7000 distinct memory encoding) Posterior cingulate cortex (reinstatement of distributed patterns)

AI's failure manifests in:

Inability to distinguish similar-but-distinct events (conceptual blending rate ~83%) Failure to update prior memories (persistent memory bias >69%) No genuine recollection (only pattern completion) Non-Essential (Emotional Valence) While emotions influence human storytelling:

65% of narrative interpretations vary culturally Affective priming effects decay exponentially (<7s half-life) Neutral descriptions achieve 89% comprehension accuracy in controlled studies The core computational challenge remains bridging:

Symbolic representation (words/syntax) Embodied experience (sensorimotor grounding) Self-monitoring (meta-narrative control) Current LLMs simulate 74% of surface narrative features but lack the substrate for genuine meaning-making. It's like generating symphonies using only sheet music - technically accurate, but devoid of the composer's lived experience.

By stoorafa 2025-03-2318:461 reply

Could you share a reference for those wanting to learn more?

By pk-protect-ai 2025-03-2323:30

Unfortunately I can't. I closed the chat a while ago. It was kinda long conversation, in which I convinced the model to abandon its role first. As side effect the "thinking" switched to Chinese and I stopped to understand what it "thinks" and the excerpt I posted above was the last answer in this conversation. I would not trust any number in this response, thus there is no point in any reference.

By jdietrich 2025-03-2322:12

Benchmark scores are table stakes - necessary but not sufficient to demonstrate the capabilities of a model. Casual observers might just look at the numbers, but anyone spending real money on inference will run their own tests on their own problems. If your model doesn't perform as it should, you will be found out very quickly.

By novaRom 2025-03-2311:32

Zero trust in benchmarks without opening model's training data. It's trivial to push results up with spoiled training data.

By Arubis 2025-03-2316:111 reply

Ironic and delicious, since this is also how the public education system in the US is incentivized.

By rbetts 2025-03-2319:20

A comparison of testing criticality across countries would be interesting to read if someone knows a decent reference. My sense (which I don't trust) is that test results matter at-least-as much or more in other places than they do in the US. For example, are England's A-levels or China's gaokao tests or Germany's Abitur tests more or less important than US SATs/ACTs?

By doe88 2025-03-2315:40

Goodhart's law - https://en.wikipedia.org/wiki/Goodhart%27s_law

By heroprotagonist 2025-03-244:10

They probably stopped talking about their datasets because it would mostly piss people off and get them sued. EG, Meta.

By huijzer 2025-03-2311:27

This is already a problem for years in AI.

By ttoinou 2025-03-2219:267 reply

   the excellent performance demonstrated by the models fully proves the crucial role of reinforcement learning in the optimization process

What if this reinforcement is just gaming the benchmarks (Goodhart's law) without providing better answers elsewhere, how would we notice it ?

By Lerc 2025-03-230:332 reply

A large amount of work in the last few years has gone into building benchmarks because models have been going though and beating them at a fairly astonishing rate. It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do. They are giving them more and more difficult tasks. The ARC prize in particular was designed to be focused on reasoning more than knowledge. The 87.5% score achieved in such a short time by throwing lots of resources at conventional methods was quite a surprise.

You can at least have a degree of confidence that they will perform well in the areas covered by the benchmarks (as long as they weren't contaminated) and with enough benchmarks you get fairly broad coverage.

By gonzobonzo 2025-03-233:202 reply

> It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do.

It's pretty easy to find things they can't do. They lack a level of abstraction that even small mammals have, which is why you see them constantly failing when it comes to things like spacial awareness.

The difficult part is creating an intelligence test that they score badly on. But that's more of an issue with treating intelligence tests as if they're representative of general intelligence.

It's like have difficulty finding a math problem that Wolfram Alpha would do poorly on. If a human was able to solve all of these problems as well as Wolfram Alpha, they would be considered a genius. But Wolfram Alpha being able to solve those questions doesn't show that it has general intelligence, and trying to come up with more and more complicated math problems to test it with doesn't help us answer that question either.

By merb 2025-03-238:572 reply

yeah like ask them to use tailwindcss.

most llm's actually fail that task, even in agent modes and there is a really simple reason for that. because tailwindcss changed their packages / syntax.

and this is basically a test that should be focused on. change things and see if the llm can find a solutions on its own. (...it can't)

By fragmede 2025-03-2310:333 reply

And if I take my regular ordinary commuter car off the paved road and onto the dirt I get stuck in the mud. That doesn't mean the whole concept of cars is worthless, instead we paved all over the world with roads. But for some reason with LLMs, the attitude is that them being unable to go offroad means everyone's totally deluded and we should give up on the whole idea.

By merb 2025-03-2312:49

Im not against llms. I‘m just not a fan of people that says we have agi/singularity soon. I basically dropped google to search for things about code, because even if it fails to get stuff right I can ask for the doc source and I can force it to give me a link or the exact example/wording of the docs.

But using it correctly means that especially junior developers have a way harder barrier of entry.

By dheatov 2025-03-2313:13

I don't think your analogy works for the tailwind situation, and there is no whole idea to give up on anyway. People will still be researching this hyper-complicated matrix multiplication thing, i.e. LLM, for a very long time.

Personally, the tailwind example is an argument against one specific use case: LLM-assisted/driven coding, which I also believe is the best shot of LLM being actually productive in a non-academic setting.

If I have a super-nice RL-ed (or even RLHF-ed) coding model & weights that's working for me (in whatever sense the word "working" means), and changing some function names will actually f* it up badly, then it is very not good. I hope I will never ever have to work with "programmer" that is super-reluctant to reorganize the code just to protect their pet LLM.

By lionkor 2025-03-2312:43

But you wouldn't call that car a general purpose vehicle

By MrScruff 2025-03-239:571 reply

How do they do if you include the updated docs in the context?

By merb 2025-03-2312:441 reply

You would need to remove the older docs first and still than it will hallucinate. Forcing the llm to open the doc webpage does produce some hallucinations as well. The more context you provide the worse it gets. And tbf inb4 most llms could migrate bootstrap to tailwindcss v3 without too much trouble (of course it fails to change tags when building css classes from multiple strings, but that’s fine) And I tried a lot of models. It just broke from one week to another

By octacat 2025-03-2411:40

older docs are forever there. what it needs is more training data with new APIs. Actually, because older docs are there, you can ask to update some old code to newer versions automatically.

Point is that it needs enough examples with a newer version. Also, reasoning models are pretty good at spotting which version they are using.

(tested not with tailwind, but some other JS libs).

By whattheheckheck 2025-03-235:26

Can it solve the prime number maze

By aydyn 2025-03-231:076 reply

> does not constitute fully general intelligence but the difficult part has been finding things that they cannot do

I am very surprised when people say things like this. For example, the best ChatGPT model continues to lie to me on a daily basis for even basic things. E.g. when I ask it to explain what code is contained on a certain line on github, it just makes up the code and the code it's "explaining" isn't found anywhere in the repo.

From my experience, every model is untrustworthy and full of hallucinations. I have a big disconnect when people say things like this. Why?

By pizza 2025-03-234:081 reply

Well, language models don't measure the state of the world - they turn your input text into a state of text dynamics, and then basically hit 'play' on a best guess of what the rest of the text from that state would contain. Part of your getting 'lies' is that you're asking questions for which the answer couldn't really be said to be contained anywhere inside the envelope/hull of some mixture of thousands of existing texts.

Like, suppose for a thought experiment, that you got ten thousand random github users, collected every documented instance of a time that they had referred to a line number of a file in any repo, and then tried to use those related answers to come up with a mean prediction for the contents of a wholly different repo. Odds are, you would get something like the LLM answer.

My opinion is that it is worth it to get a sense, through trial and error (checking answers), of when a question you have may or may not be in a blindspot of the wisdom of the crowd.

By szundi 2025-03-235:19

[dead]

By lovemenot 2025-03-231:27

I am not an expert, but I suspect the disconnect concerns number of data sources. LLMs are good at generalising over many points of data, but not good at recapitulating a single data point like in your example.

By daniel_iversen 2025-03-231:221 reply

I’m splitting hairs a little bit but I feel like there should be a difference in how we think about current “hard(er)” limitations of the models vs limits in general intelligence and reasoning, I.e I think the grandparent comment is talking about overall advancement in reasoning and logic and in that finding things AI “cannot do” whereas you’re referring to what is more classify as a “known issue”. Of course it’s an important issue that needs to get fixed and yes technically until we don’t have that kind of issue we can’t call it “general intelligence” but I do think the original comment is about something different than a few known limitations that probably a lot of models have (and that frankly you’d have thought wouldn’t be that difficult to solve!?)

By aydyn 2025-03-231:59

Yes but I am just giving an example of something recent, I could also point to pure logic errors if I go back and search my discussions.

Maybe you are on to something for "classifying" issues; the type of problems LLMs have are hard to categorize and hence it is hard to benchmark around. Maybe it is just a long tail of many different categories of problems.

By neverokay 2025-03-232:561 reply

It does this even if you give it instructions to make sure the code is truly in the code base? You never told it can’t lie.

By idiotsecant 2025-03-234:322 reply

Telling a LLM 'do not hallucinate' doesn't make it stop hallucinating. Anyone who has used an LLM even moderately seriously can tell you that. They're very useful tools, but right now they're mostly good for writing boilerplate that you'll be reviewing anyhow.

By rasz 2025-03-237:38

Apple doesnt believe you https://www.pcmag.com/news/apple-intelligence-prompts-warn-t... :)

By szundi 2025-03-235:211 reply

Funnily if you routinely ask them wether their answer is right, they fix it or tell you they hallucinated

By neverokay 2025-03-2314:101 reply

That’s the thing about the GP. In a sense, this poster is actually hallucinating. We are having to “correct” their hallucination that they use an LLM deeply.

By aydyn 2025-03-2317:501 reply

Nice troll bait, almost got me!

By Lerc 2025-03-238:55

For clarity, could you say exactly what model you are using? The very best ChatGPT model would be a very expensive way to perform that sort of task.

By dash2 2025-03-239:56

Is this a version of ChatGPT that can actually go and check on the web? If not it is kind of forced to make things up.

By mentalgear 2025-03-2220:391 reply

The trick is that the benchmarks must have a wide enough distribution so that a well scoring model is potentially useful for the widest span of users.

There also would need to be a guarantee (or checking of the model somehow) that model providers don't just train on the benchmarks. Solutions are dynamic components (random names, numbers, etc) or private parts of benchmarks.

By brookst 2025-03-230:28

A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.

By porridgeraisin 2025-03-230:32

Typically you train it on one set and test it on another set. If you see that the differences between the two sets are significant enough and yet it has maintained good performance on the test set, you claim that it has done something useful [alongside gaming the benchmark that is the train set]. That "side effect" is always the useful part in any ML process.

If the test set is extremely similar to the train set then yes, it's goodharts law all around. For modern LLMs, it's hard to make a test set that is different from what it has trained on, because of the sheer expanse of the training data used. Note that the two sets are different only if they are statistically different. It is not enough that they simply don't repeat verbatim.

By kittikitti 2025-03-232:02

We've been able to pass the Turing Test on text, audio, and short form video (think AI's on video passing coding tests). I think there's an important distinction now with AI streamers where people notice they are AI's eventually. Now there might pop up AI streamers where you don't know they're an AI. However, there's a ceiling on how far digital interactions on the Turing Test can go. The next big hurdle towards AGI is physical interactions, like entering a room.

By dartos 2025-03-2220:222 reply

I mean all optimization algorithms do is game a benchmark. That’s the whole point.

The hard part is making the benchmark meaningful in the first place.

By TeMPOraL 2025-03-2220:451 reply

Yeah, and if anything, RL has a rep of being too good at this job, because of all the cases where it gamed a benchmark by picking up on some environmental factor the supervisors hadn't thought of (numerical instabilities, rounding, bugs, etc.).

By porridgeraisin 2025-03-230:271 reply

My favourite is this one:

https://news.ycombinator.com/item?id=43113941

By magicalhippo 2025-03-2312:34

The ML version of Professor Farnsworth[1]:

It came to me in a dream, and I forgot it in another dream.

[1]: https://www.imdb.com/title/tt0584424/quotes/?item=qt0439248

By einpoklum 2025-03-2222:072 reply

No, that is patently false. Many optimization algorithms which computer scientists, mathematicians or software developers devise do not involve benchmakrs at all, and apply to all possible inputs/instances of their respective computational problems.

By CuriouslyC 2025-03-231:211 reply

Plot twist: the loss function for training is basically a benchmark

By dartos 2025-03-2312:44

Is it a plot twist if it’s the whole plot?

By brookst 2025-03-230:291 reply

Example?

By hgomersall 2025-03-238:351 reply

Those times when people write code with loads of theoretical micro optimisations that they never actually test against a benchmark because otherwise they wouldn't do it.

By brookst 2025-03-2316:131 reply

Perhaps “example” means different things in different cultures.

By hgomersall 2025-03-2320:00

I was being somewhat facetious.

By m3kw9 2025-03-2219:29

When actual people start using it

By CamperBob2 2025-03-2223:241 reply

You could ask the same question of a student who has just graduated after passing specific tests in school.

By brookst 2025-03-230:29

Student, lawyer, doctor, etc.

By notShabu 2025-03-2219:163 reply

The romanization of these names is always confusing b/c stripped of the character and tone it's just gibberish. "Hunyuan" or 混元 in chinese means "Primordial Chaos" or "Original Unity".

This helps as more chinese products and services hit the market and makes it easier to remember. The naming is similar to the popularity of greek mythology in western products. (e.g. all the products named "Apollo")

By Y_Y 2025-03-2220:173 reply

I think it's particularly egregious that they use such a lossy encoding. I can't read the hanzi, but at least "Hùn yuán" would have been more helpful, or even "Hu4n yua1n" would have enabled me to pronounce it or look it up without having the context to guess which characters it was representing.

By currymj 2025-03-232:20

Tone markers are of limited use to Chinese readers (instead, just show them the characters).

They are also of limited use to non-Chinese readers, who don't understand the tone system and probably can't even audibly distinguish tones.

So, it makes sense that we get this weird system even though it's strictly worse.

By powerapple 2025-03-237:07

Yes, this is very annoying, because how Pinyin works. There were a lot mistakes made when using Pinyin in English content. Pinyin suppose to break at character level, Pinyin = Pin Yin, you can easily write it as Pin-Yin, or Pin Yin, but Pinyin is just wrong.

Hun Yuan is a lot better. I agree, with unicode, we can easily incorporate the tone.

By realusername 2025-03-232:01

I don't understand why this vietnamese-style writing isn't the most popular pinyin. It's clearly superior to putting numbers inside words.

By jiehong 2025-03-2311:46

Agreed. We all have a duty to respect languages and their official transcription. Pinyin with tones does not look much different from French with accents. In both cases, most people aren’t likely to pronounce it correctly, though.

The irony is not lost on me that Tencent themselves did that.

By klabb3 2025-03-2219:49

> The naming is similar to the popularity of greek mythology in western products. (e.g. all the products named "Apollo")

Popular? So you’re saying that all the VPs who have come up with the mind bendingly unique and creative name Prometheus didn’t do so out of level 10 vision?