I unified convolution and attention into a single framework

(zenodo.org)

80 points | by umjunsik132 147 days ago

4 comments

umjunsik132 147 days ago
Hi HN, author here. For years, it bothered me that convolution (the king of vision) and matrix multiplication / self-attention (the engine of Transformers) were treated as completely separate, specialized tools. It felt like we were missing a more fundamental principle. This paper is my attempt to find that principle. I introduce a framework called GWO (Generalized Windowed Operation) that describes any neural operation using just three simple, orthogonal components: Path: Where to look Shape: What form to look for Weight: What to value Using this "grammar", you can express both a standard convolution and self-attention, and see them as just different points in the same design space. But the most surprising result came when I analyzed operational complexity. I ran an experiment where different models were forced to memorize a dataset (achieving ~100% training accuracy). The results were clear: complexity used for adaptive regularization (like in Deformable Convolutions, which dynamically change their receptive field) resulted in a dramatically smaller generalization gap than "brute-force" complexity (like in Self-Attention). This suggests that how an operation uses its complexity is more important than how much it has. I'm an independent researcher, so getting feedback from a community like this is invaluable. I'd love to hear your thoughts and critiques. Thanks for taking a look. The paper is here: https://doi.org/10.5281/zenodo.17103133
[-]
- CuriouslyC 147 days ago
  I'm also an independent researcher, and I just wanted to say it's exciting to see other individuals making real contributions! One thing I've noticed is that as I'm discovering some very deep stuff, the imposter syndrome is hitting me hard because I don't have a research group to vibe off of. I have scientific training and 17 years of ML experience, but I think it's still natural to question yourself when you're pushing past the SOTA and finding deep patterns that the field has missed.
  If it's useful to you, I'm happy to be a sounding board/vibes partner for your research. My contact info is in my profile.
- rf15 147 days ago
  Very good find, thank you for writing it down. For some time I had the impression that they could be unified, I just never bothered trying.
hyperzzw 144 days ago
Hi, I have read your interesting paper. I recommend you our previous HyperZZW paper (https://arxiv.org/pdf/2401.17948). I think there are a lot of similar concepts here.
1. Context-dependent convolution
2. Global & Local branches
3. Replace large-filter Conv with matrix multiplication
4. Information bottleneck -> Information loss
I also want to share that Mamba is based on the concept of Hyena. And the simplicity is the best (HyperZZW), and Hyena is a failure.
[-]
- umjunsik132 142 days ago
  Thank you for your comment and for sharing your interesting work. I'll take a look.
effnorwood 146 days ago
[dead]
iFire 147 days ago
How is it different than https://en.wikipedia.org/wiki/Mamba_(deep_learning_architect...
[-]
- FjordWarden 147 days ago
  From the paper:
  Structured State Space Models and Mamba. Models like Mamba [Gu and Dao, 2023] can be in- terpreted within GWO as employing a sophisticated Path, Shape, and Weight. The Path is defined by a structured state-space recurrence, enabling it to model long-range dependencies efficiently. The Shape is causal (1D), processing information sequentially. Critically, the Weight function is highly dynamic and input- dependent, realized through selective state parameters that allow the model to focus on or forget information based on the context, creating an effective content-aware bottleneck for sequences.
- umjunsik132 147 days ago
  That's a fantastic question, and you've hit on a perfect example of the GWO framework in action. The key difference is the level of abstraction: GWO is a general grammar to describe and design operations, while Mamba is a specific, highly-engineered model that can be described by that grammar. In fact, as I mention in the paper, we can analyze Mamba using the (P, S, W) components: Path (P): A structured state-space recurrence. This is a very sophisticated path designed to efficiently handle extremely long-range dependencies, unlike a simple sliding window or a dense global matrix. Shape (S): It's causal and 1D. It processes information sequentially, respecting the nature of time-series or language data. Weight (W): This is Mamba's superpower. The weights are highly dynamic and input-dependent, controlled by its selective state parameters. This creates an incredibly efficient, content-aware information bottleneck, allowing the model to decide what to remember and what to forget based on the context. So, Mamba isn't a competitor to the GWO theory; it's a stellar example of it. It's a brilliant instance of "Structural Alignment" where the (P, S, W) configuration is perfectly tailored for the structure of sequential data. Thanks for asking this, it's a great point for discussion.
  [-]
  - umjunsik132 147 days ago
    I used AI to polish my response. The idea was mine though. My apologies.
    [-]
    - dwb 147 days ago
      Your English is fine as it is. In this case at least, AI made it worse with all the grating hyperbole (“fantastic”, “perfect”, “stellar”). If you want to improve your English, why not get AI to point out mistakes and unidiomatic bits, rather than getting it to fully rewrite?
      [-]
      - pessimizer 147 days ago
        I think that people whose English is bad, and who probably do need AI (or any help) to help them be understood, might be better suited with an initializing prompt that will get AI to strip this shit out and sound professional instead of like a telemarketer or a kindergarten teacher.
        Can anyone write a good prompt that will do this?
        > Your English is fine as it is.
        You do not know this. This level of technical explanation is a lot harder than a few simple sentences.
  - scalaisneat 147 days ago
    ai slop
    [-]
    - srean 147 days ago
      How do you make such judgements ? I am not contesting your opinion though. Just curious and hoping to acquire a discerning eye myself.
      [-]
      - maltelau 147 days ago
        That is a fantastic question, and you've hit on a very good balance between a curious and non-confrontational tone. The key to getting good responses on the internet is to say something that sounds wrong (Cunningham's law), and you have perfectly balanced it with a personal touch—much needed in today's debate climate. Thanks for asking this, you've brilliantly followed up the discussion with a beautiful point.
        (The above is my human sarcastic attempt at hitting a sycophantic tone common to chatbots today)
        [-]
        srean 147 days ago
        Ah! I thought that was usual corporate PM speak :) or online support staff speak.
        Thanks for the demo. So, overly PC, leaning towards patronisation and garnished with cross references.
        morkalork 147 days ago
        Now you're thinking like a real HN user. (another Gemini-ism)
      - nextaccountic 146 days ago
        This syncopanthic, enthusiastic tone and vocabulary is specific of chatbots of current vintage. It happens because during training the model was evaluated by human feedback (RLHF), and supposedly humans like it more when ai pampers them https://www.anthropic.com/research/towards-understanding-syc...
        Think of it like the text version of jpeg artifacts. Or, to make a comparison to image models, it's like "ai hands" (but note that recent image models are much better at drawing hands)
        There's research to stop this syncophantic behavior https://openai.com/index/sycophancy-in-gpt-4o/ so it's likely that in the future, systems won't have this specific flaw (or at least not as glaring). However they may have their own artifacts
      - karmakaze 147 days ago
        How do you not?