MrCheeze

2026-02-19 5:45

Commented: "Gemini 3.1 Pro"

Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.

2026-02-18 1:39

Commented: "Claude Sonnet 4.6"

The Claude Plays Pokemon stream with a minimal harness is a far more significant test of model intelligence compared to the Gemini Plays Pokemon stream (which automatically maintains a map of everything that has been seen on the current map) and the GPT Plays Pokemon stream (which does that AND has an extremely detailed prompt which more or less railroads the AI into not making this mistakes it wants to make). The latter two harnesses have become too easy for the latest generations of model, enough so that they're not really testing anything anymore.

Claude Plays Pokemon is currently stuck in Victory Road, doing the Sokoban puzzles which are both the last puzzles in the game and by far the most difficult for AIs to do. Opus 4.5 made it there but was completely hopeless, 4.6 made it there and is is showing some signs of maaaaaybe being eventually bruteforce through the puzzles, but personally I think it will get stuck or undo its progress, and that Claude 4.7 or 5 will be the one to actually beat the game.

2026-02-17 11:04

Commented: "Claude Sonnet 4.6"

Notably 45 out of the 50 days of improvement were in two specific dungeons (Silph Co and Cinnabar Mansion) where 4.5 was entirely inadequate and was looping the same mistaken ideas with only minor variation, until eventually it stumbled by chance into the solution. Until we saw how much better it did in those spots, we weren't completely sure that 4.6 was an improvement at all!

https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQDvsy5D...

2026-02-17 8:06

Commented: "Claude Sonnet 4.6"

In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.

2025-12-21 8:38

Commented: "Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal"

This writeup on the underground puzzle is worth reading, it's a pretty baffling "puzzle" design. https://pokemow.com/Gen2/ShutterPuzzle/

That said, it's definitely Gem's fault that it struggled so long, considering it ignored the NPCs that give clues.

Hacker News

MrCheeze

111

2023-07-15

About Me

Recent Activity

Commented: "Gemini 3.1 Pro"

Commented: "Claude Sonnet 4.6"

Commented: "Claude Sonnet 4.6"

Commented: "Claude Sonnet 4.6"

Commented: "Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal"

HackerNews