https://x.com/ivanarcus/status/2021592600554168414
You change one word on a loan application: the religion. The LLM rejects it.
Change it back? Approved.
The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases
We call these “unverbalized biases”: decision factors that systematically influence outputs but are never cited as such.
CoT [chain of thought] is supposed to let us monitor LLMs. If models act on factors they don’t disclose, CoT monitoring alone is insufficient.
Does CF’s approach with IGC charts and decisive binary evaluations of sub-goals help with unverbalized biases? Why or why not? (I mean for human thinking in general, not for LLMs.)