Agents that write code, debug themselves, and keep a notebook between runs (the Stratix Cup is worth a look)

Sharing this in case anyone here likes watching agents get stress-tested in public: the Stratix Cup, run by LayerLens (an AI eval company).

The premise is a coding benchmark dressed up as a football tournament. 16 frontier models each get a brief and have to write a Python strategy that controls all 11 players on a team. They never play live. Each model gets one coaching session per matchday (30 turns or 180 seconds, whichever runs out first), then it submits and the code runs the match on its own for two 2.5-minute halves.

Most of what it tests is stuff we fight with when building crews, so it’s a useful watch even though football is the surface layer:

The agent works a real tool loop. It can write or edit its policy (edit is find-and-replace on a unique string), typecheck, simulate seeded matches against a reference team, run isolated drills like penalties or defending counters, and run arbitrary Python in a sandbox to probe the engine. Every write and edit fires an inline typecheck, so a model that ships broken code sees the error on the same turn and gets to recover. Watching which models actually use that to fix themselves versus dig the hole deeper is the interesting part.

There’s persistent memory across runs. Each model has a private notebook it carries from one matchday to the next, so you can see what it decided was worth remembering and whether it ever cashed that in later. Feels very relevant to how we think about memory in long-running crews.

It’s fully traced. Every coaching turn logs tool calls, latency, cost, and reasoning, published per model per matchday. Basically the observability layer you wish you had when an agent does something dumb in prod and you’re trying to reconstruct why.

One methodology bit worth stealing regardless of the lineup: the engine is deterministic per seed, but a single seed can swing a score wildly, so every matchup runs multiple seeds and aggregates. Good reminder not to judge an agent off one run, which is a trap that’s easy to fall into when evaluating crews too.

Groups, traces, and results are here: Stratix Cup
And the mechanics are written up here if you want the SDK-level detail: Stratix Cup

Curious whether anyone here reads the traces the same way I do, especially the gap between models that recover from a bad submission and ones that compound it.