Uh yeah good question. I have a contradictory prediction that right now is maybe the cheapest inference we’ll see for a while due to the amount of subsidization happening – so planning around cheap inference might not be safe. At some point someone will hit the brakes and Anthropic will wind back the like 25x you get through the $200/mo plan to something more reasonable (on the $200/mo plan you can do up to like $5k/mo priced at API costs).
There are some reasons I can think of that LLMs might get cheaper, but these are post-hoc and I wan’t thinking about these at the time:
With chips, we have a few new chips coming online (see Groq (with a q) and Cerebras), so that might be one avenue. Another is architectural changes to models and how inference is done.
I think we’ll see small models continue to improve, both from fine tuning and distillation (training smaller models from outputs of bigger ones). There are also a few new architectures that are promising at the small scale (which may or may not pan out).