As an example of free tier being a bad way to evaluate models (unless you’re interested in it specifically), here are some things I’ve re-tested that seem to fail on recent chat models but worked fine for me:
These examples are from https://www.youtube.com/@FatherPhi but finding the exact video is proving to be painful since YT don’t have an indicator for which shorts you’ve watched.
Pick a number less than 1000 with the letter ‘a’ in it (excluding ‘and’): basically all the models failed. (And none tried negative but Phi would have limited it to >= 0 in that case anyway.)
I tried this with glm-5 (a recent chinese model) and it thought through the problem logically and came to the right answer immediately. Besides some duplicate logic in it’s thinking output, it was the same as what I’d do (although it also didn’t consider negatives).
Counting from 1 to 100 or 200: Phi has a few videos on that, but I have had even cheap models successfully take 200+ items and assemble them into numbered lists without it missing any numbers. (Maybe the extra content helps.) In the most recent case it was organizing tasks for a backlog, and while the tasks were numbered sequentially, there were also phases, milestones, and epics that were in a flat tree order (so also the whole ordering was lexicographic).
This isn’t exactly free tier vs not free, but I think it’s indicative of the quality difference one sees. And Phi uses voice mode which I guess tries to almost always avoid thinking first.