i asked it to write a tetris game. sonnet 4.6 in the claude app was WAY better at the same task and also gpt kept being like “ok here’s the real fixed definitely working version” (while screwing up again) which is dumber than how claude talks. i also tried local qwen3 coder with ollama which was super bad at writing tetris.
i’ve used gemini pro with the vscode extension and with the web app and found it a lot worse than opus. i have heard it can be good on some specific tasks where you want a really big context window.
note: i usually use the LLMs to give me coding advice, find bugs, and write little things, not to generate large amounts of code.
Ahh free tier stuff is problematic for evaluations IMO, especially without thinking enabled. It’s hard to know what’s going on with chatgpt behind the scenes due to model routing and automatic cost reduction measures they take. and chat providers have different prompts and tools available in chat interfaces that can significantly alter things (gemini 3 has a big section in it’s prompt to make web stuff look way better, similar to calude’s frontend-design skill)
Also, I note you have GPT-5.3 selected not 5.3-codex or 5.4. Theo (t3) said recently he thinks openai are moving away from the -codex fine-tuned models going forward which hopefully will make this distinction clearer.
I’ll try 5.3-codex and 5.4 later today and see what they come up with.
As an example of free tier being a bad way to evaluate models (unless you’re interested in it specifically), here are some things I’ve re-tested that seem to fail on recent chat models but worked fine for me:
These examples are from https://www.youtube.com/@FatherPhi but finding the exact video is proving to be painful since YT don’t have an indicator for which shorts you’ve watched.
Pick a number less than 1000 with the letter ‘a’ in it (excluding ‘and’): basically all the models failed. (And none tried negative but Phi would have limited it to >= 0 in that case anyway.)
I tried this with glm-5 (a recent chinese model) and it thought through the problem logically and came to the right answer immediately. Besides some duplicate logic in it’s thinking output, it was the same as what I’d do (although it also didn’t consider negatives).
Counting from 1 to 100 or 200: Phi has a few videos on that, but I have had even cheap models successfully take 200+ items and assemble them into numbered lists without it missing any numbers. (Maybe the extra content helps.) In the most recent case it was organizing tasks for a backlog, and while the tasks were numbered sequentially, there were also phases, milestones, and epics that were in a flat tree order (so also the whole ordering was lexicographic).
This isn’t exactly free tier vs not free, but I think it’s indicative of the quality difference one sees. And Phi uses voice mode which I guess tries to almost always avoid thinking first.
If we are strictly talking about positive whole numbers, there aren’t any. From one (1) to nine hundred ninety-nine (999), the letter “a” doesn’t appear a single time in standard English spelling. The very first positive integer to use it is one thousand.
However, since your rules just asked for “a number less than 1000,” here are a couple of technically correct options that fit the bill:
-1: Spelled “negative one.”
1/2: Spelled “one half.”
Would you like me to track down a number containing a different specific letter?
opus 4.6
There is no such number. No number less than 1000 contains the letter ‘a’ in its English spelling (excluding “and”). The first appearance of ‘a’ is in “one thousand” (1,000).
gpt 5.3 codex
There isn’t one in standard English spelling: no cardinal number from 0 to 999 contains the letter “a” unless you include the word “and.”
gpt 5.4
There isn’t one.
In standard English, no cardinal number below 1000 contains the letter a unless you count the a in and as in one hundred and one.
I ended up testing more models. They were all pretty average IMO but mostly worked. The sonnet one has stupidly loud music (I set it to 10% but it’s doing linear volume, so still loud). The prompt is on the page.