What training a world-class LLM looks like (hint: it is messy!)
Every few months, a new state-of-the-art AI model is released. The technical reports are polished, the loss curves are smooth, and every decision seems obvious in hindsight. It all looks so clean.
But the reality is far more chaotic.
The team at Hugging Face just pulled back the curtain with their Smol Training Playbook, an incredibly detailed account of training their 3B parameter SmolLM3 model. Far from being a story of flawless execution, it’s a drama of debugging sessions, unexpected failures, and hard-won lessons.
I’ve dug into the article and pulled out the most interesting takeaways. There’s no carefully polished narrative here; this is an honest look at what it really takes to build a world-class LLM (a far cry from the £75 budget one I trained myself).
1. The Real Question Isn’t “How” to Train—It’s “If”
The playbook opens with what is probably a surprising piece of advice for a training guide: You probably don’t need to train your own model.
In a world filled with powerful open-source models like Qwen3, Gemma3, and Llama 3, most use cases can be delivered with clever prompting or fine-tuning. The authors argue that many projects fail not from bad hyperparameters, but because they burned months of engineering time and millions in compute to build something that already existed, or worse, that nobody needed.
Before you write a single line of code, their “Training Compass” forces you to answer the hard questions. It’s a reminder to check your ego at the door and start with what’s already available.
2. The Hidden Cost That Will Wreck Your Budget
What does the cost of training an LLM like this look like? You’re not just paying for the final training run.
The SmolLM3 team revealed a sobering statistic: ablations (experiments) and debugging consumed over 160,000 GPU-hours, which was more than 50% of the cost of the final 11-trillion-token training run.
This is the invisible iceberg of LLM development. The endless cycle of testing ideas, hunting down bugs, and rerunning experiments isn’t a small side task: it’s often the most expensive part of the entire project. If you’re not budgeting for it, you’re planning to fail.
3. War Story: The 1 Trillion Token Restart
At the heart of the playbook are the “war stories”—the moments where everything went wrong. The most dramatic? The team had to restart their training run after burning through 1 TRILLION tokens.
The culprit was a bug so subtle it was almost invisible. When using tensor parallelism to split the model across GPUs, they had accidentally used the same random seed for each parallel shard. The model was still learning, but its performance was silently capped.
They only caught it because they were rigorously comparing its progress against a smaller, older model. Without that baseline, they might have trained the entire model to completion, only to realize it was a costly disappointment. It’s a lesson in how tiny implementation details can have huge consequences.
4. Your Intuition Is a Trap. Trust Only the Data.
The playbook is a graveyard of “obvious” ideas that failed spectacularly in practice.
- The “Good Data” That Hurt Performance: The team thought training on arXiv papers—a trove of human knowledge—would make the model smarter. Instead, it made the small model worse because the academic style was too niche.
- Too Much of a Good Thing: Everyone knows code data helps with reasoning. But when the team increased the code in their data mix from 10% to 25%, the model’s performance on English benchmarks tanked. They had to revert the change.
- The “Vibe Check” Bug: Automated evaluations showed the model was following instructions perfectly. But when the team just started chatting with it (a “vibe check”) they realized it was completely ignoring their system prompts. A bug was silently stripping the system message from every single training example. Remember - metrics don’t tell the whole story.
The Takeaway
Training a world-class LLM isn’t a clean, linear process. It’s a messy, iterative science of experimentation, relentless debugging, and resilience. The Smol Training Playbook reveals that success comes not from a perfect recipe, but from a disciplined process of testing assumptions, preparing for failure, and learning from the inevitable chaos.
It’s one of the most honest documents to come out of the AI space, and a must-read for anyone who wants to understand what it really takes.