NanoTransformer White-Box Lab

Parameter Layer	Shape	Params	Mean	Std	Gradient (Avg Abs)
Total Parameters:	0

🎯 Free Inference Lab

Edit the input below to test the model's next-token prediction probability in real time.

🧊 3D Flow

1st Digit Prediction

2nd Digit Prediction (based on 1st)

📖 Project Overview

In 2025, Large Language Models like GPT-4, Claude, Gemini, and DeepSeek are reshaping every industry. They write code, compose poetry, pass bar exams, and even reason through complex math problems. Behind all of them lies a single, revolutionary architecture: the Transformer.

But here's the problem — for most people, a Transformer is a black box. You've seen the diagrams, read the formulas, maybe even watched a dozen YouTube tutorials. Yet when you try to answer the simplest question — "What exactly happens inside the model when it sees 7*8= and outputs 56?" — you draw a blank. The matrices are too large, the layers too deep, the numbers too abstract.

This project fixes that.

We took the exact same architecture that powers GPT — Decoder-only Transformer with Multi-Head Self-Attention, Layer Normalization, Feed-Forward Networks, Causal Masking, and Softmax output — and compressed it into a model with only 18,304 parameters (GPT-4 has ~1.8 trillion). We gave it the simplest possible task: memorize the multiplication table (0×0 through 9×9).

Then we cracked it wide open. Every single matrix. Every attention weight. Every intermediate calculation. Nothing hidden, nothing omitted. You're about to witness, number by number, how a neural network turns 7*8= into 56.

Component	Configuration
Vocabulary	14 tokens: <pad>, 0-9, *, =, \n
d_model	32
Layers	2
Attention Heads	4 (d_head = 8)
FFN Hidden Dim	64
Max Context Length	16
Total Parameters	18,304
Task	Memorize 100 multiplication facts (0×0 to 9×9)
Accuracy	100%

Multiplication Table via Transformer: NanoTransformer White-Box Lab

🧠 Live Neural Network Monitor

🎯 Free Inference Lab

1st Digit Prediction

2nd Digit Prediction (based on 1st)

📖 Project Overview

Forward Pass Computation Panorama