Optimizing Training with FlashAttention varlen

23 Mar 2026

I’ve come to think of varlen primarily as the most efficient FlashAttention variant for training (it’s not used for generating tokens) because it handles our technique of “processing a batch of examples” more efficiently–by treating them as one long concatenated sequence, rather than adding an additional “batch dimension” to the input tensors, which handles varying sequence lengths less naturally.

Output Latent Spaces in Multihead Attention

28 Jul 2025

Recent models like DeepSeek-V3 and Moonshot’s Kimi-K2, built using Multihead Latent Attention (MLA), have shown that constraining the input spaces of attention heads can be both effective and efficient. They project the input token vector–size 7,168–down to just 512 dimensions for keys and values, and to 1,536 for queries. Despite this aggressive compression, performance holds up well enough to support these frontier-scale models.

Reading and Writing with Projections

10 Jul 2025

Transformers store, retrieve, and modify data along different feature directions in their model space, via projections.

The Inner Workings of Multihead Latent Attention (MLA)

26 Apr 2025

Multihead Latent Attention (MLA), introduced by DeepSeek in their V2 model, is an alternative to standard attention (and other variants such as MQA and GQA) which dramatically reduces memory bandwidth requirements for the attention calculations.

Patterns and Messages - Part 6 - Vocabulary-Based Analysis

21 Feb 2025

What had me most excited about the merged matrix perspective (and perhaps overly so) was that the patterns and messages are in model space, the same dimension as the vocabulary.