π§ Live Neural Network Monitor
Displays real-time dimensions, mean, std, and gradient magnitude of every parameter layer.
| Parameter Layer | Shape | Params | Mean | Std | Gradient (Avg Abs) |
|---|---|---|---|---|---|
| Total Parameters: | 0 | ||||
π― Free Inference Lab
Edit the input below to test the model's next-token prediction probability in real time.
1st Digit Prediction
2nd Digit Prediction (based on 1st)
π Project Overview
In 2025, Large Language Models like GPT-4, Claude, Gemini, and DeepSeek are reshaping every industry. They write code, compose poetry, pass bar exams, and even reason through complex math problems. Behind all of them lies a single, revolutionary architecture: the Transformer.
But here's the problem β for most people, a Transformer is a black box. You've seen the diagrams, read the formulas, maybe even watched a dozen YouTube tutorials. Yet when you try to answer the simplest question β "What exactly happens inside the model when it sees 7*8= and outputs 56?" β you draw a blank. The matrices are too large, the layers too deep, the numbers too abstract.
This project fixes that.
We took the exact same architecture that powers GPT β Decoder-only Transformer with Multi-Head Self-Attention, Layer Normalization, Feed-Forward Networks, Causal Masking, and Softmax output β and compressed it into a model with only 18,304 parameters (GPT-4 has ~1.8 trillion). We gave it the simplest possible task: memorize the multiplication table (0Γ0 through 9Γ9).
Then we cracked it wide open. Every single matrix. Every attention weight. Every intermediate calculation. Nothing hidden, nothing omitted. You're about to witness, number by number, how a neural network turns 7*8= into 56.
| Component | Configuration |
|---|---|
| Vocabulary | 14 tokens: <pad>, 0-9, *, =, \n |
| d_model | 32 |
| Layers | 2 |
| Attention Heads | 4 (d_head = 8) |
| FFN Hidden Dim | 64 |
| Max Context Length | 16 |
| Total Parameters | 18,304 |
| Task | Memorize 100 multiplication facts (0Γ0 to 9Γ9) |
| Accuracy | 100% |