I did this stuff: https://mrcheeze.github.io/
The Claude Plays Pokemon stream with a minimal harness is a far more significant test of model intelligence compared to the Gemini Plays Pokemon stream (which automatically maintains a map of everything that has been seen on the current map) and the GPT Plays Pokemon stream (which does that AND has an extremely detailed prompt which more or less railroads the AI into not making this mistakes it wants to make). The latter two harnesses have become too easy for the latest generations of model, enough so that they're not really testing anything anymore.
Claude Plays Pokemon is currently stuck in Victory Road, doing the Sokoban puzzles which are both the last puzzles in the game and by far the most difficult for AIs to do. Opus 4.5 made it there but was completely hopeless, 4.6 made it there and is is showing some signs of maaaaaybe being eventually bruteforce through the puzzles, but personally I think it will get stuck or undo its progress, and that Claude 4.7 or 5 will be the one to actually beat the game.
Notably 45 out of the 50 days of improvement were in two specific dungeons (Silph Co and Cinnabar Mansion) where 4.5 was entirely inadequate and was looping the same mistaken ideas with only minor variation, until eventually it stumbled by chance into the solution. Until we saw how much better it did in those spots, we weren't completely sure that 4.6 was an improvement at all!
https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQDvsy5D...
In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.
This writeup on the underground puzzle is worth reading, it's a pretty baffling "puzzle" design. https://pokemow.com/Gen2/ShutterPuzzle/
That said, it's definitely Gem's fault that it struggled so long, considering it ignored the NPCs that give clues.
This project is an enhanced reader for Ycombinator Hacker News: https://news.ycombinator.com/.
The interface also allow to comment, post and interact with the original HN platform. Credentials are stored locally and are never sent to any server, you can check the source code here: https://github.com/GabrielePicco/hacker-news-rich.
For suggestions and features requests you can write me here: gabrielepicco.github.io