Evaluating chain-of-thought monitorability

(openai.com)

38 points | by mfiguiere 2 days ago

3 comments

  • ursAxZA 3 hours ago
    I might be missing something here as a non-expert, but isn’t chain-of-thought essentially asking the model to narrate what it’s “thinking,” and then monitoring that narration?

    That feels closer to injecting a self-report step than observing internal reasoning.

    • crthpl 1 hour ago
      the chain of thought is what it is thinking
      • skissane 13 minutes ago
        When we think, our thoughts are composed of both nonverbal cognitive processes (we have access to their outputs, but generally lack introspective awareness of their inner workings), and verbalised thoughts (whether the “voice in your head” or actually spoken as “thinking out loud”).

        Of course, there are no doubt significant differences between whatever LLMs are doing and whatever humans are doing when they “think” - but maybe they aren’t quite as dissimilar as many argue? In both cases, there is a mutual/circular relationship between a verbalised process and a nonverbal one (in the LLM case, the inner representations of the model)

      • ursAxZA 1 hour ago
        Chain-of-thought is a technical term in LLMs — not literally “what it’s thinking.”

        As far as I understand it, it’s a generated narration conditioned by the prompt, not direct access to internal reasoning.

      • arthurcolle 31 minutes ago
        Wrong to the point of being misleading. This is a goal, not an assumption

        Source: all of mechinterp

      • Bjartr 55 minutes ago
        It is text that describes a plausible/likely thought process that conditions future generation by it's presence in the context.
  • ramoz 3 hours ago
    > Our expectation is that combining multiple approaches—a defense-in-depth strategy—can help cover gaps that any single method leaves exposed.

    Implement hooks in codex then.

  • leetrout 2 hours ago
    Related check out chain of draft if you haven't.

    Similar performance with 7% of tokens as chain of thought.

    https://arxiv.org/abs/2502.18600