Speeding Up JumpReLU SAE Inference with Custom Triton Kernels (2–14× on Real SAEs)

According to a post on LessWrong by Daniel Tiourine, researchers working with Sparse Autoencoders—a key tool for understanding how neural networks process information—have been stuck with an inefficiency: standard inference treats network activations as dense matrices, even though most values are zero. Tiourine demonstrates how to exploit this sparsity using custom Triton GPU kernels that skip zero entries entirely during computation. The approach comes in two flavors: exact allocation, which dynamically determines memory needs, and fixed allocation, which pre-allocates a fixed amount of memory and trades minor waste for speed. In practice, the optimization delivers a 2- to 14-fold speedup on real SAE workloads, potentially unlocking mechanistic interpretability research at larger scales.

Source: https://www.lesswrong.com/posts/8gZspSs4WFtpfki9i/speedin...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton