ChatGPT and Current AIs Are Dumb

on thing jumped out. in your terminal log the referee passes the updated FEN string back to the model on every move. that could be why they play better without thinking tokens. generating fuzzy associative english tokens probably just adds analog noise. the more it ‘thinks’, the more chances it has to hallucinate a piece’s position. the referee OTOH acts as an external constraint checker. injecting the deterministic FEN string forces a hard resynchronization with reality. that external error-correction loop is doing all the heavy lifting to anchor the state, so ‘thinking’ just dilutes the anchor.

have you tested them without the FEN feedback? i wonder how that will change the result

(Ironically, the script sounds AI-assisted to me.)

Yeah. One thing that’s not great for chess (and I’m not sure if there’s a better format) is that the encoded shorthand we use probably does not map well to tokens.

Also, not all thinking is worse. gpt54nano is better it seems with thinking, so it might come down to the model + how they trained reasoning and attention (eg training it from the outputs of other LLMs might make for more ‘fake reasoning’ than ‘real reasoning’ which works okay for formulaic websites but not for chess)

The other thing is that I’m trying to make the chess games work and resolve, too. Which means through the harness design I’m choosing stuff that makes it easier for the AI to produce valid output which also means valid moves. Human games don’t really work like that – they’re more like 2 players without a ref and no error checking, but they aren’t so likely to call each other out for invalid moves or cheating (and sometimes they claimed a move was invalid when it wasn’t and the other one went along with it).

So I’m not sure what the actual best design would be. Maybe I should give them all a list of valid moves to start with, regardless of the latest board state.

Also I added Mercury 2 which is a diffusion LLM and, some hiccups aside, it seems to work well enough. It has reasoning too. It might actually have more potential for doing well because the reasoning + explicit move justification in output could interact in ways that a normal LLM cannot.

Anyway if anyone has suggestions on the chess thing LMK.

I originally wasn’t providing it, I had an AI referee who had access to a tool. (That ended up being more error than correction, though – at least at the protocol level but also just general inconsistency.)

Now the ref is all code so I can make it just omit the FEN feedback and just say invalid or valid, but will need to add a rule about number of consecutive errors allowed.

Maybe i’ll do a few tests to check different auto-ref configs using different configs of FEN-on-err, FEN-always, FEN-never, and valid moves with those 3 variants too.

Maybe ‘clanker’ was triggering it? Otherwise, this seems an odd thing for content filtering to block.

interesting. my guess is FEN-always is doing more than just convenience there, it’s probably acting like a hard state resync every move. maybe weaker models get carried by that a lot more, while better reasoning models depend on it less.

i think OAI’s focus on science acceleration isn’t just PR posturing. they talk about automating scientific discovery. they’re training frontier reasoning models rosalind for life sciences just like codex for coding. think it’s a nod to Franklin which is quite cool

Anyone else had issues with Anthropic and their dodgy practices? Here’s an example:

  • removes opus 200k from list of models in claude code; only 1m model left. with the 200k model (which was smarter, too), you could leave a job running and it would never go over 200k tokens.
  • no feature to set a compaction limit, and 1m auto-compaction happens around 950k tokens or something.
  • 5min normal cache time (there’s an option to pay 2x for 1hr cache time)

then:

  • API rate limiting or service disruption happens → caches go cold
  • When everyone’s claude code reconnects, massively eats into quota
  • Anyone using extra usage (API billing) gets hit pretty hard.
  • Even with just normal quota, this happened to me just now and I went from 5% of 5h usage to 22%.
  • (Though the outage was less than 5 minutes, so I think I got routed to a different server/DC or something to trigger a cache miss)
my example

They’re doing other dodgy stuff too, there’s a quote from an Anthropic employee in Louis’s video saying they can’t/won’t fix billing issues due to their own mistakes:

and of course there’s a laundry list of complaints people have about them specifically, like making opus dumber over time without telling people (and possibly gaslighting).