A comprehensive technical blog and white-box interactive laboratory. We crack open the black box of LLMs to show you exactly how Attention, RoPE, and MoE algorithms work under the hood.
From rote memorization to genuine algorithmic generalization. We train a 4.7M parameter miniature Transformer to achieve 100% accuracy on all 2-digit multiplication pairs using Chain-of-Thought (CoT) prompting and reversed-addition datasets. Discover why data engineering trumps architecture.
What happens when sequences get long? We replace standard learned positional embeddings with RoPE, accelerating convergence by 300% and unlocking sequence length extrapolation capabilities. We dive deep into the Fourier shift theorem that makes RoPE mathematically elegant.
Scaling up from 2-digit to 4-digit multiplication without exponential data growth. We employ curriculum learning strategies (SFT) to teach the model progressive difficulty. Learn how specialized datasets (zero-padding, carry-overs) drive 99.9%+ accuracy on 4-digit mathematics.
Why is token generation so slow without caching? We implement KV Cache (Prefill + Decode phases) from scratch to avoid O(n²) redundant re-computation. We benchmark the exact acceleration metrics on long mathematical reasoning sequences.
We hand-write block-wise online softmax (FlashAttention) to understand GPU memory hierarchy and I/O optimization. We also experiment with Kimi-style Attention Residuals (AttnRes) and analyze feature collapse phenomena when training is pushed to the limit.
An experimental teardown of DeepSeek-V3's Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE). We observe how low-rank KV compression acts as a natural regularizer, pushing our 4.7M model to an unprecedented 100% accuracy on long sequences. Plus: error-book fine-tuning efficiency.
Step inside the forward pass. Using Three.js, we visualize the exact flow of matrices through Embeddings, Attention blocks, and FFNs. Watch the numbers transform as we trace how "7*8=" turns into "56" in real-time 3D.
Before Transformers, there were MLPs. We build a pure-Numpy neural network from scratch, implement backpropagation entirely by hand, and teach it the multiplication table. A fundamental lesson in calculus and gradient descent.