Get the latest tech news
What happened to BERT and T5?
A Blogpost series about Model Architectures Part 1: What happened to BERT and T5? Thoughts on Transformer Encoders, PrefixLM and Denoising objectives
A quick primer (Skip connection to next section if you feel confident) There are mainly three overarching paradigms of model architectures in the past couple of years. A variant of this is a Prefix Language model or PrefixLM architecture, which does almost the same thing minus the cross attention ( and some other small details like sharing weights between encoder/decoder plus not having no encoder bottleneck). An advantage of this autoregressive style of denoising “shift to the back” is that it allows the model to not only learn longer range dependencies but also implicitly benefit from non-explicit bidirectional attention (since you would already have seen the future in order to fill in the blank).
Or read this on Hacker News