Gradient-free Single-pass Model Beats nanoGPT on Shakespeare

Researchers have found something striking: a statistical model that simply counts character patterns beats a neural network that learns through gradient descent—and it does so with zero trainable parameters. The model, EntropyBeam, works in a single pass through training text, building frequency tables of character patterns. During prediction, it evaluates multiple context lengths simultaneously, scoring each by how confident the prediction is and how often that pattern appeared. The final prediction comes from a weighted blend of these different context lengths. The results outperform nanoGPT, a small transformer model. EntropyBeam achieves a validation loss of one point six nats, beating nanoGPT's two point one. More impressively, it uses sixty-eight thousand times less computation for training and thirty times less per prediction. Wall-clock time: twelve seconds versus twenty-six. According to LessWrong, the findings suggest that for smaller language tasks, simple statistical approaches can outcompete modern deep learning. It's a reminder that complexity isn't always the answer.

Source: https://www.lesswrong.com/posts/rSiwKbisKmMhALpBk/gradien...

Listen to this story

Hear this and more stories in a personalized audio briefing.

Open The Chonkerton