Per your point 4, some current hyped work is pushing hard in this direction [1, 2, 3]. The basic idea is to think of attention as a way of implementing an associative memory. Variants like SDPA or gated linear attention can then be derived as methods for optimizing this memory online such that a particular query will return a particular value. Different attention variants correspond to different ways of defining how the memory produces a value in response to a query, and how we measure how well the produced value matches the desired value.
Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.
It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.
[1] https://arxiv.org/abs/2501.00663
Yes, you can get good compression of a long sequence of "base" text tokens into a shorter sequence of "meta" text tokens, where each meta token represents the information from multiple base tokens. But, grouping a fixed number of base tokens into each meta token isn't ideal, since that won't align neatly with sensible semantic boundaries, like words, phrases, sentences, etc. So, the trick is how decide which base tokens should be grouped into each meta token....
This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.
The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.