A Legend of Zelda-inspired N64 homebrew ROM featuring Sophia Elya — an AI NPC
powered by a nano-GPT transformer running live inference on the MIPS R4300i CPU.
No precomputed responses. No lookup tables. Real matrix multiply, real softmax, real
attention — on a 93.75 MHz CPU from 1996.
Video demos:
Download ROM:
legend_of_elya.z64— ready to run in ares emulator or EverDrive 64
- Press A near Sophia Elya to trigger AI dialog
- The N64 CPU runs a full 4-layer transformer: embedding → attention → FFN → logits → sampling
- Output tokens appear character-by-character with a live tok/s counter
- Each response is different — seeded by CPU oscillator jitter (hardware entropy)
- 32 prompts covering identity, Zelda lore, RustChain, hardware trivia
- Runs in the ares emulator and on real N64 hardware via EverDrive 64
| Parameter | Value |
|---|---|
| Parameters | 819,200 (819K) |
| Layers | 4 |
| Embedding dim | 128 |
| Attention heads | 4 (32-dim each) |
| Vocabulary | 256 (byte-level ASCII) |
| Context window | 64 tokens |
| Quantization | Q8 (int8 weights + float16 block scales, 32-weight blocks) |
| Weight file | 458 KB on cartridge ROM |
| Inference math | Float32 on MIPS R4300i FPU |
| Speed | ~60 tok/s in emulator, ~1-3 tok/s on real hardware |
| KV cache | 256 KB in RDRAM |
| Total RDRAM | ~263 KB (KV cache + 7KB scratch) |
- Float32 inference — all activations, attention scores, and accumulations are IEEE 754 float32
- On-the-fly Q8 dequantization — weights stay compressed as int8 in ROM; dequantized per matmul
- Custom Taylor exp() — range-reduction
exp(x) = exp(x/128)^128with degree-4 Taylor series and 7 squarings. Uses zero float-to-int casts to avoid the R4300i's missingtrunc.w.sinstruction - Quake III fast inverse sqrt —
0x5f3759dfbit trick with 2 Newton-Raphson iterations for RMS normalization - Big-endian aware — weight file is little-endian (Python export), N64 is big-endian.
swap16/swap32helpers handle byte-order conversion for header fields and float16 scales - Hardware entropy — MIPS CP0 Count register XOR'd with frame counter for RNG seeding
- Greedy sampling — pure argmax over printable ASCII (32-126), matching proven x86 reference quality
- Embedding scale restoration — Q8 export normalizes to [-1,1]; the original scale factor (em=3.5) is stored in header byte and restored at init
| File | Purpose |
|---|---|
nano_gpt.c |
Float32 GPT inference engine (MIPS R4300i) |
nano_gpt.h |
Model struct definitions, KV cache, API |
legend_of_elya.c |
Game: dungeon scene, sprites, dialog, music, HUD |
train_sophia_v5.py |
PyTorch training + Q8 weight export |
weights/sophia_weights.bin |
Pre-trained v5 weights (458KB, ready to use) |
Makefile |
libdragon build system |
src/ |
Latest source snapshots |
screenshots/ |
Working N64 LLM screenshots |
Download legend_of_elya.z64 from Releases and load in ares emulator or copy to EverDrive SD card.
Requires libdragon toolchain:
# Set toolchain path
export N64_INST=/path/to/mips64-toolchain # Place weights in filesystem/
cp weights/sophia_weights.bin filesystem/ # Build
make clean && make # Run in ares
ares legend_of_elya.z64# Requires PyTorch + CUDA GPU
python3 train_sophia_v5.py
# ~20 min on RTX 5070, exports filesystem/sophia_weights.binThe weights/sophia_weights.bin file contains a pre-trained v5 model (819K params, Q8 format, 458KB).
Training corpus covers: Sophia Elya identity, RustChain blockchain, Zelda lore, N64 hardware, PowerPC architecture, dungeon/RPG dialog.
Weight file format:
| Offset | Size | Field |
|---|---|---|
| 0 | 4 | Magic: 0x53454149 ("SEAI"), little-endian |
| 4 | 1 | n_layers (4) |
| 5 | 2 | n_embed (128) |
| 7 | 1 | n_heads (4) |
| 8 | 2 | vocab_size (256) |
| 10 | 1 | ctx_len (64) |
| 11 | 1 | em_scale_x16 (56 = 3.5 × 16) |
| 12 | 32768 | Embedding table (256 × 128, int8) |
| 32780 | ... | Layer weights (int8) + scales (float16) × 4 layers |
- 819K parameters. Responses are short and sometimes imprecise ("rinces" instead of "Princess"). Expected at this scale. The achievement is real-time transformer inference on 1996 hardware.
- Context window is 64 tokens. Prompt + response must fit in 64 bytes.
- No memory between dialogs. KV cache resets each conversation.
- Byte-level vocabulary. One ASCII character per token — no subword tokenization.
- Training corpus is small. More data and epochs will improve coherence.
The goal is to shrink, optimize, and package this into a reusable SDK that any N64 homebrew developer can drop into their game to give NPCs real language understanding.
| Config | Layers | Embed | Params | Weight Size | RAM (KV+scratch) | Use Case |
|---|---|---|---|---|---|---|
| Tiny | 2 | 64 | ~100K | ~60KB | ~70KB | Simple responses, many NPCs |
| Small | 4 | 128 | 819K | 458KB | 263KB | Current — single NPC dialog |
| Medium | 6 | 192 | ~2.8M | ~1.5MB | 600KB | Rich dialog, Expansion Pak |
| Large | 8 | 256 | ~8.4M | ~4.2MB | 1.6MB | Full conversations, 8MB mode |
Every "AI NPC" in modern games is a cloud API call. This runs entirely on the cartridge — no internet, no server, no loading screen. The VR4300 does the matrix math. The ROM holds the weights. The RDRAM holds the KV cache.
It's the same transformer architecture as GPT — just 819K parameters instead of 175 billion. And it runs on hardware that predates Google.
If we can make a transformer talk on 8MB of RAM and a 93MHz MIPS CPU, the excuses for cloud-dependent "AI" in games evaporate.
| IBM POWER8 Response | Zelda Triforce Response |
|---|---|
![]() |
![]() |
Built by Elyan Labs.
- Engine: nano-GPT float32 inference on MIPS R4300i
- Game: libdragon SDK, pixel art, LOZ-inspired dungeon
- Training: PyTorch on RTX 5070
- Platform: BoTTube for video hosting
Source is open — build it, train it, improve it, port it.

