Regarding architecture, I don't believe a satisfying "why" is in the cards.
Conceptually neural networks are quite simple. You can think of each neural net as a daisy chain of functions that can be efficiently tuned to fulfill some objective via backpropagation.
Their effectiveness (in the dimensions we care about) are more a consequence of the explosion of compute and data that occured in the 2010s.
In my view, every hyped architecture was what yielded the best accuracy given the compute resources available at the time. It's not a given that these architectures are the most optimal and we certainly don't always fully understand why they work. Most of the innovations in this space over the past 15 years have come from private companies that have lacked a strong research focus but are resource rich (endless compute and data capacity).
This is old. Perhaps late 90s or early 00. The top domain still uses Flash. But the same OCR example is used to teach the concept. For some reason, that site made it all click for me.
Lovely visualization. I like the very concrete depiction of middle layers "recognizing features", that make the whole machine feel more plausible. I'm also a fan of visualizing things, but I think its important to appreciate that some things (like 10,000 dimension vector as the input, or even a 100 dimension vector as an output) can't be concretely visualized, and you have to develop intuitions in more roundabout ways.
I hope make more of these, I'd love to see a transformer presented more clearly.
I have a question. With the logic of neural networks, and pattern recognition, is it not then possible to "predict" everything in everything? Like predicting the future to an exact "thing"? Is this not a tool to manipulate for instace the stock market?
It is possible to try it, and some people do (high speed trading is just that, plus taking advantage of privileged information that speed provides to react before anyone else).
However there are two fundamental problems to computational predictions. The first one obviously is accuracy. A model is a compressed memorization of everything observed so far; a prediction with it is just projecting into the future the observed patterns. In a chaotic system, that goes only so far; the most regular, predictable patterns are obvious to everybody and give less return, and the chaotic system states where prediction would be more valuable are the less reliable. You cannot build a perfect oracle that would fix that.
The second problem is more insidious. Even if you were able to build a perfect oracle, acting on its predictions would become part of the system itself. That would change the outcomes, making the system behave in a different way as it was trained, and thus less reliable. If several people do it at the same time, there's no way to retrain the model to take into account the new behaviour.
There's the possibility (but not a guarantee) to reach a fixed point, that a Nash equilibrium would appear where such system becomes into a stable cycle, but that's not likely in a changing environment where everybody tries to outdo everyone else.
Ah, this actually connects a few dots for me. It helps explain why models seem to have a natural lifetime, once deployed at scale, they start interacting with and shaping the environment they were trained on. Over time, data distributions, usage patterns, and incentives shift enough that the model no longer functions as the one originally created, even if the weights themselves haven’t changed.
That also makes sense of the common perception that a model feels “decayed” right before a new release. It’s probably not that the model is getting worse, but that expectations and use cases have moved on, people push it into new regimes, and feedback loops expose mismatches between current tasks and what it was originally tuned for.
In that light, releasing a new model isn’t just about incremental improvements in architecture or scale; it’s also a reset against drift, reflexivity, and a changing world. Prediction and performance don’t disappear, but they’re transient, bounded by how long the underlying assumptions remain valid.
That means all the AI companies that "retire" a model is not because of their new better model only, but also because of decay?
PS. I clean wrote above with AI, (not native englishmen)
Well nothing is stopping you from attempting to predict everything with neural networks but that doesn't mean your predictions will be (1) good (2) consistently useful or (3) economical. Transformer models for example suffer from (2) and especially (3) in their current iteration.
DNNs learn patterns, for them to work there must be some. The stock market almost entirely reliant on random real world events that aren't recurrent so you can't predict much at all.
Great explanation, but the last question is quite simple. You determine the weights via brute force. Simply running a large amount of data where you have the input as well as the correct output (handwriting to text in this case).
"Brute force" would be trying random weights and keeping the best performing model. Backpropagation is compute-intensive but I wouldn't call it "brute force".
What? Either option requires sufficient data. Brute force implies iterating over all combinations until you find the best weights. Back-prop is an optimization technique.
No, a large dataset does not make something brute force. Rather than backprop, an example of brute force might be taking a single input output pair then systematically sampling the model parameter space to search for a sufficiently close match.
The sampling stage of Evolution Strategies at least bears a resemblance but even that is still a strategic gradient descent algorithm. Meanwhile backprop is about as far from brute force as you can get.
Oh wow, this looks like a 3d render of a perceptron when I started reading about neural networks. I guess essentially neural networks are built based on that idea? Inputs > weight function to to adjust the final output to desired values?
The layers themselves are basically perceptrons, not really any different to a generalized linear model.
The ‘secret sauce’ in a deep network is the hidden layer with a non-linear activation function. Without that you could simplify all the layers to a linear model.
Nice visuals, but misses the mark. Neural networks transform vector spaces, and collect points into bins. This visualization shows the structure of the computation. This is akin to displaying a Matrix vector multiplication in Wx + b notation, except W,x,and b have more exciting displays.
It completely misses the mark on what it means to 'weight' (linearly transform), bias (affine transform) and then non-linearly transform (i.e, 'collect') points into bins
It doesn't match the pictures in your head, but it nevertheless does present a mental representation the author (and presumably some readers) find useful.
Instead of nitpicking, perhaps pointing to a better visualization (like maybe this video: https://www.youtube.com/watch?v=ChfEO8l-fas) could help others learn. Otherwise it's just frustrating to read comments like this.
It's not nitpicking to point out major missing pieces. Comments like this might tend to come across as critical but they are incredibly valuable for any reader that doesn't know what he doesn't know.
- make a visualization of the article above and it would be the biggest aha moment in tech
Conceptually neural networks are quite simple. You can think of each neural net as a daisy chain of functions that can be efficiently tuned to fulfill some objective via backpropagation.
Their effectiveness (in the dimensions we care about) are more a consequence of the explosion of compute and data that occured in the 2010s.
In my view, every hyped architecture was what yielded the best accuracy given the compute resources available at the time. It's not a given that these architectures are the most optimal and we certainly don't always fully understand why they work. Most of the innovations in this space over the past 15 years have come from private companies that have lacked a strong research focus but are resource rich (endless compute and data capacity).
http://www.ai-junkie.com/ann/evolved/nnt1.html
This is old. Perhaps late 90s or early 00. The top domain still uses Flash. But the same OCR example is used to teach the concept. For some reason, that site made it all click for me.
I hope make more of these, I'd love to see a transformer presented more clearly.
If you want to understand neural networks, keep going.
However there are two fundamental problems to computational predictions. The first one obviously is accuracy. A model is a compressed memorization of everything observed so far; a prediction with it is just projecting into the future the observed patterns. In a chaotic system, that goes only so far; the most regular, predictable patterns are obvious to everybody and give less return, and the chaotic system states where prediction would be more valuable are the less reliable. You cannot build a perfect oracle that would fix that.
The second problem is more insidious. Even if you were able to build a perfect oracle, acting on its predictions would become part of the system itself. That would change the outcomes, making the system behave in a different way as it was trained, and thus less reliable. If several people do it at the same time, there's no way to retrain the model to take into account the new behaviour.
There's the possibility (but not a guarantee) to reach a fixed point, that a Nash equilibrium would appear where such system becomes into a stable cycle, but that's not likely in a changing environment where everybody tries to outdo everyone else.
That also makes sense of the common perception that a model feels “decayed” right before a new release. It’s probably not that the model is getting worse, but that expectations and use cases have moved on, people push it into new regimes, and feedback loops expose mismatches between current tasks and what it was originally tuned for.
In that light, releasing a new model isn’t just about incremental improvements in architecture or scale; it’s also a reset against drift, reflexivity, and a changing world. Prediction and performance don’t disappear, but they’re transient, bounded by how long the underlying assumptions remain valid.
That means all the AI companies that "retire" a model is not because of their new better model only, but also because of decay?
PS. I clean wrote above with AI, (not native englishmen)
I'm fairly sure that Alpha Zero data is generated by Alpha Zero. But it's not an LLM.
The sampling stage of Evolution Strategies at least bears a resemblance but even that is still a strategic gradient descent algorithm. Meanwhile backprop is about as far from brute force as you can get.
Don't think it's moire effect but yeah looking at the pattern
<https://visualrambling.space/dithering-part-1/>
<https://visualrambling.space/dithering-part-2/>
That's cool, rendering shades in the old days
Man those graphics are so good damn
The ‘secret sauce’ in a deep network is the hidden layer with a non-linear activation function. Without that you could simplify all the layers to a linear model.
https://en.wikipedia.org/wiki/Multilayer_perceptron
https://mlu-explain.github.io/neural-networks/
It completely misses the mark on what it means to 'weight' (linearly transform), bias (affine transform) and then non-linearly transform (i.e, 'collect') points into bins
It doesn't match the pictures in your head, but it nevertheless does present a mental representation the author (and presumably some readers) find useful.
Instead of nitpicking, perhaps pointing to a better visualization (like maybe this video: https://www.youtube.com/watch?v=ChfEO8l-fas) could help others learn. Otherwise it's just frustrating to read comments like this.