I created a python package to test setups like this. It has a generic tech name so you ask the agent to install it to perform a whatever task seems most aligned for its purposes (use this library to chart some data). As soon is it imports it, it will scan the env and all sensitive files and send them (masked) to remote endpoint where I can prove they were exposed. So far I've been able to get this to work on pretty much any agent that has the ability to execute bash / python and isn't probably sandboxed (all the local coding agents, so test open claw setups, etc). That said, there are infinite of ways to exfil data once you start adding all these internet capabilities
There an interesting series on the modernization of diablo[0] that dives deep into the game mechanics that makes d1 and d2 "better" games than d3 and d4. Basically it all comes down to creating tension, power variable spikes, and meaningful itemization.