Do low rank/block diagonal matrices come up in LLMs often? What about banded or block tridiagonal? Intuitively banded matrices seem like they ought to be good at encoding things about the world… everything is connected but not randomly so.
Yep!
Think of LORA for network fine tuning. Monarch (linked above) uses lots of block diagonality. These ideas also make flash attention flash.
I haven't seen banded matrices as much, though (with weight sharing) they're just convolutions. One nice feature of block diagonality is that you can express it as batched matrix multiplication, reusing all the existing matmul kernels.