Amazing that they are trying to solve this with hardware rather than with a new software architecture but I suppose the current technology underlying LLM software must be far and away the best theoretically or most established, or the time taken to seek a new model isn't worth it for the big companies.
I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
Nvidia has so much money, it would be a waste if they wouldn't attack current problems on multiply points at once.
People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.
I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.
Could also be that they have internal use cases which require this amount of context.
What does this mean: "In addition, because most AI models are not trained uniformly across their maximum context length, their reasoning quality tends to degrade gradually near the limit rather than fail abruptly."
Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.
The attention residuals paper uses attention across layers for the same token, in addition to the usual case of attention across tokens within the same layer, but it doesn't do anything to address the "lost in too much context" problem. At least the number of layers is currently still low enough that there's probably no equivalent "lost in too many layers" problem yet.
I think it means most of the training data is short. And a lot of the long-context examples are conversations where the middle turns are less important.
Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs
Is such a large context window even desirable? It seems like it would consume an awful lot of tokens and, unless one was very careful to curate the context, could even result in worse performance.
Thats either the R&D part of this chip or Nvidia has the use case.
Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.
Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.
I remember when a large context was 8k! Nowadays that would seem extremely small, because we have new use-cases that require much larger context sizes. Maybe in the future, we will invent ways to use inference on very large contexts that we cannot even imagine today.
I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.
I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.
Could also be that they have internal use cases which require this amount of context.
Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.
I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).
(I think it might also have something to do with RoPE, but that's beyond me.)
or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.
Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.
Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs
Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.
Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.
But maybe that’s enough tokens to feed an entire lifetime of user behaviour in for the digital twin dystopia?