Demystifying Transformers & Large Language Models

A comprehensive technical blog and white-box interactive laboratory. We crack open the black box of LLMs to show you exactly how Attention, RoPE, and MoE algorithms work under the hood.

Start Learning Try Interactive Lab

Latest Masterclass Tutorials

Chapter 1: Teaching a Transformer to Multiply (Chain of Thought)

From rote memorization to genuine algorithmic generalization. We train a 4.7M parameter miniature Transformer to achieve 100% accuracy on all 2-digit multiplication pairs using Chain-of-Thought (CoT) prompting and reversed-addition datasets. Discover why data engineering trumps architecture.

Chain-of-Thought Data Engineering Generalization

Chapter 2: Rotary Position Embedding (RoPE) and Extrapolation

What happens when sequences get long? We replace standard learned positional embeddings with RoPE, accelerating convergence by 300% and unlocking sequence length extrapolation capabilities. We dive deep into the Fourier shift theorem that makes RoPE mathematically elegant.

RoPE Positional Encoding Math

Chapter 3: Supervised Fine-Tuning Curriculum & Self-Evolution

Scaling up from 2-digit to 4-digit multiplication without exponential data growth. We employ curriculum learning strategies (SFT) to teach the model progressive difficulty. Learn how specialized datasets (zero-padding, carry-overs) drive 99.9%+ accuracy on 4-digit mathematics.

Curriculum Learning SFT Data Curations

Chapter 4: KV Cache Optimization for Long Sequence Inference

Why is token generation so slow without caching? We implement KV Cache (Prefill + Decode phases) from scratch to avoid O(n²) redundant re-computation. We benchmark the exact acceleration metrics on long mathematical reasoning sequences.

KV Cache Inference Optimization Performance

Chapter 5: FlashAttention and Attention Residuals

We hand-write block-wise online softmax (FlashAttention) to understand GPU memory hierarchy and I/O optimization. We also experiment with Kimi-style Attention Residuals (AttnRes) and analyze feature collapse phenomena when training is pushed to the limit.

FlashAttention CUDA I/O AttnRes

Chapter 6: DeepSeek-V3/V4 Architecture (MLA & MoE)

An experimental teardown of DeepSeek-V3's Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE). We observe how low-rank KV compression acts as a natural regularizer, pushing our 4.7M model to an unprecedented 100% accuracy on long sequences. Plus: error-book fine-tuning efficiency.

DeepSeek-V3 MLA MoE

Chapter 7: 3D Matrix Flow Visualization

Step inside the forward pass. Using Three.js, we visualize the exact flow of matrices through Embeddings, Attention blocks, and FFNs. Watch the numbers transform as we trace how "7*8=" turns into "56" in real-time 3D.

Three.js 3D Visualization Forward Pass

Chapter 8: Building a Multi-Layer Perceptron from Scratch

Before Transformers, there were MLPs. We build a pure-Numpy neural network from scratch, implement backpropagation entirely by hand, and teach it the multiplication table. A fundamental lesson in calculus and gradient descent.

NumPy Backpropagation Neural Networks Fundamentals