Get the latest tech news

I unified convolution and attention into a single framework


The operational primitives of deep learning, primarily matrix multiplication and convolution, existas a fragmented landscape of highly specialized tools. This paper introduces the Generalized WindowedOperation (GWO), a theoretical framework that unifies these operations by decomposing them into threeorthogonal components: Path, defining operational locality; Shape, defining geometric structure andunderlying symmetry assumptions; and Weight, defining feature importance.We elevate this framework to a predictive theory grounded in two fundamental principles. First, weintroduce the Principle of Structural Alignment, which posits that optimal generalization is achievedwhen the GWO’s (P, S, W) configuration mirrors the data’s intrinsic structure. Second, we show thatthis principle is a direct consequence of the Information Bottleneck (IB) principle. To formalizethis, we define an Operational Complexity metric based on Kolmogorov complexity. However, wemove beyond the simplistic view that lower complexity is always better. We argue that the nature ofthis complexity—whether it contributes to brute-force capacity or to adaptive regularization—isthe true determinant of generalization. Our theory predicts that a GWO whose complexity is utilized toadaptively align with data structure will achieve a superior generalization bound. Canonical operationsand their modern variants emerge as optimal solutions to the IB objective, and our experiments reveal thatthe quality, not just the quantity, of an operation’s complexity governs its performance. The GWO theorythus provides a grammar for creating neural operations and a principled pathway from data propertiesto generalizable architecture design.

The operational primitives of deep learning, primarily matrix multiplication and convolution, existas a fragmented landscape of highly specialized tools. Our theory predicts that a GWO whose complexity is utilized toadaptively align with data structure will achieve a superior generalization bound. Canonical operationsand their modern variants emerge as optimal solutions to the IB objective, and our experiments reveal thatthe quality, not just the quantity, of an operation’s complexity governs its performance.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of attention

attention

Photo of convolution

convolution

Photo of single framework

single framework

Related news:

News photo

Already I'm convinced, Hollow Knight: Silksong is a hymn to the art of paying attention - and it absolutely rules

News photo

Researcher who found McDonald's free-food hack turns her attention to Chinese restaurant robots

News photo

Paying attention to feature distribution alignment (pun intended)