GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

According to a post by researcher @fguzmanai, GateGPT is an FPGA-based transformer inference system achieving 56,000 tokens per second at 80 megahertz clock speed. By optimizing key-value cache operations in specialized hardware, the work demonstrates an alternative path to GPU-scale language model inference with potentially lower power consumption.

Source: https://twitter.com/fguzmanai/status/2065832668172845209

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton