3 comments

  • gwern 16 hours ago
    There's a vein of research which interprets self-attention as a kind of gradient descent and says that LLMs have essentially pre-solved indefinitely large 'families' or 'classes' of tasks, and the 'learning' they do at runtime is simply gradient descent (possibly Newton) using the 'observations' to figure out which pre-solved instance they are now encountering; this explains why they fail in such strange ways, especially in agentic scenarios - because if the true task is not inside those pre-learned classes, no amount of additional descent can find it after you've found the 'closest' pre-learned task to the true task. (Some links: https://gwern.net/doc/ai/nn/transformer/attention/meta-desce... )

    I wonder if this can be interpreted as consistent with that 'meta-learned descent' PoV? If the system is fixed and is just cycling through fixed strategies, that is what you'd expect from that: the descent will thrash around the nearest pre-learned tasks but won't change the overall system or create new solved tasks.

  • Mathnerd314 19 hours ago
    So, the takeaway I get from this paper is that if you have a language model and you set it up so it has an input and it generates an output that is towards some goal (e.g., "make this sentence sound smarter"), then it should converge, because it is following a potential function.

    But I have used prompts like this a fair amount, and it is more like stochastic gradient descent - most of the time, once it is close to the target, the model will take a small incremental change, but when it is really close the model will sort of say "this is not improveable as it is" and it will take a large leap to a completely different configuration. And then this will do the incremental optimizations and so on. This could be an artifact of the sampling algorithm, but I think it is also an issue that the model has this potential function encoded, but the prompt and the structure of the model do not actually minimize this potential. So, a real lesson here is that there is actually a lot of work still left to do in terms of smarter sampling. Beam search like is used today is sort of the tip of the iceberg. If we could start doing optimization with the transformer model as a component, like optimizing pipelines of reasoning rather than always generating inputs and outputs sequentially, that is where you could start using this potential function directly and then you would see orders of magnitude smarter AI. There is stuff about prompt optimization, but it is still based on treating models as black boxes rather than the piles of math they are.