8 comments

  • RyanShook 1 hour ago
    Here's where I'm missing understanding: for decades the idea of neural networks had existed with minimal attention. Then in 2017 Attention Is All You Need gets released and since then there is an exponential explosion in deep learning. I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier.
    • pash 52 minutes ago
      The inflection point was 2012, when AlexNet [0], a deep convolutional neural net, achieved a step-change improvement in the ImageNet classification competition.

      After seeing AlexNet’s results, all of the major ML imaging labs switched to deep CNNs, and other approaches almost completely disappeared from SOTA imaging competitions. Over the next few years, deep neural networks took over in other ML domains as well.

      The conventional wisdom is that it was the combination of (1) exponentially more compute than in earlier eras with (2) exponentially larger, high-quality datasets (e.g., the curated and hand-labeled ImageNet set) that finally allowed deep neural networks to shine.

      The development of “attention” was particularly valuable in learning complex relationships among somewhat freely ordered sequential data like text, but I think most ML people now think of neural-network architectures as being, essentially, optimizations, designs that facilitate learning in one context or another when data and compute are in short supply, and not as being fundamental to learning. The “bitter lesson” [1] is that more compute and more data eventually beats better models that don’t scale.

      Consider this: humans have on the order of 10^11 neurons in their body, dogs have 10^9, and mice have 10^7. What jumps out at me about those numbers is that they’re all big. Even a mouse needs hundreds of millions of neurons to do what a mouse does. Intelligence, even of a limited sort, seems to emerge only after crossing a high threshold of compute capacity.

      0. https://en.wikipedia.org/wiki/AlexNet

      1. https://en.wikipedia.org/wiki/Bitter_lesson

    • porcoda 28 minutes ago
      As others pointed out, the explosion of interest started with the deep convolutional networks that were applied in image problems. What I always thought was interesting was that prior to that, NNs were largely dismissed as interesting. When I took a course on them around the year 2000 that was the attitude most people took. It seems like what it took to spark renewed interest was ImageNet and seeing what you get when you have a ton of training data to throw at the problem and fast processors to help. After that the ball kept rolling with the subsequent developments around specific network architectures. In the broader community AlexNet is viewed as the big inflection point, but in the academic community you saw interest simmering a couple years earlier - I began to see more talks at workshops about NNs that weren’t being dismissed anymore, probably starting around 2008/09.
    • cgearhart 51 minutes ago
      A much earlier major win for deep learning was AlexNet for image recognition in 2012. It dominated the competition and within a couple years it was effectively the only way to do image tasks. I think it was Jeremy Howard who wrote a paper around 2017 wondering when we’d get a transfer learning approach that worked as well for NLP as convnets did for images. The attention paper that year didn’t immediately dominate. The hardware wasn’t good enough and there wasn’t consensus on belief that scale would solve everything. It took like five more years before GPT3 took off and started this current wave.

      I also think you might be discounting exactly how much compute is used to train these monsters. A single 1ghz processor would take about 100,000,000 years to train something in this class. Even with on the order of 25k GPUs training GPT3 size models takes a couple months. The anemic RAM on GPUs a decade ago (I think we had k80 GPUs with 12GB vs 100’s of GBs on H100/H200 today) and it was actually completely impossible to train a large transformer model prior to the early 2020s.

      I’m even reminded how much gamers complained in the late 2010s about GPU prices skyrocketing because of ML use.

    • embedding-shape 1 hour ago
      > I understand that deep learning is accelerated by GPUs but the concept of a transformer could have been used on much slower hardware much earlier

      But they don't give the same results at those smaller scales. People imagined, but no one could have put into practice because the hardware wasn't there yet. Simplified, LLMs is basically Transformers with the additional idea of "and a shitton of data to learn from", and for making training feasible with that amount of data, you do need some capable hardware.

    • whateverboat 59 minutes ago
      The same thing happened with matrices. We had matrices for 400 years, but the field of linear algebra and especially numerical linear algebra exploded only with advent of computers.

      In olden days, the correct way to solve a linear system of equations was to use theory of minors. With advent of computers, you suddenly had a huge theory of gaussian elimination, or Krylov spaces and what not.

    • quicklywilliam 26 minutes ago
      Agreed, there is probably a theoretical world where we got enough money/compute together and had this explosion happen earlier.

      Or perhaps a world where it happened later. I think a big part of what enabled the AI boom was the concentration of money and compute around the crypto boom.

    • BigTTYGothGF 1 hour ago
      The modern neural net revival got kicked off long before 2017.
      • noosphr 55 minutes ago
        Alex net in 2012 is only 5 years earlier.
    • teekert 1 hour ago
      If you are in the radiology field it started “exploding” much earlier, with CNNs.
    • CamperBob2 35 minutes ago
      the concept of a transformer could have been used on much slower hardware much earlier.

      It could have been done in the early 1970s -- see "Paper tape is all you need" at https://github.com/dbrll/ATTN-11 and the various C-64 projects that have been posted on HN -- but the problem was that Marvin Minsky "proved" that there was no way a perceptron-based network could do anything interesting. Funding dried up in a hurry after that.

    • wslh 59 minutes ago
      Don't understimate the massive data you need to make those networks tick. Also, impracticable in slow training algorithms, beyond if they were in GPUs or CPUs.
  • le-mark 7 minutes ago
    > We argue complexity conceals underlying regularity, and that deep learning will indeed admit a scientific theory

    That would be amazing, but personally I’m skeptical.

  • sweezyjeezy 42 minutes ago
    Deep learning works at a very high level because 'it can keep learning from more data' better than any other approaches. But without the 'stupid amount of data' that is available now, the architecture would be kind of irrelevant. Unless you are going some way to explain both sides of the model-data equation I don't feel you have a solid basis to build a scientific theory, e.g. 'why reasoning models can reason'. The model is the product of both the architecture and training data.

    My fear is that this is as hopeless right now as explaining why humans or other animals can learn certain things from their huge amount of input data. We'll gain better understanding, but it won't ever be fundamental computer science again, because the giga-datasets are the fundamental complexity not the architecture.

  • UltraSane 47 minutes ago
    I think we need the equivalent of general relativity for latent spaces.
  • 4b11b4 1 hour ago
    wow.. this would be cool. Instead of just.. guessing "shapes"
    • NitpickLawyer 1 hour ago
      tbf, we've learned (ha!) more from smashing teeny tiny particles and "looking" at what comes out than from say 40 years of string theory. Sometimes doing stuff works, and the theory (hopefully) follows.
  • adzm 2 hours ago
    I'm only partially through this paper, but it's written in a very engaging and thoughtful manner.

    There is so much to digest here but it's fascinating seeing it all put together!

  • amelius 1 hour ago
    "A New Kind of Science" ...