Show HN: AxonML – A PyTorch-equivalent ML framework written in Rust

2026-02-2823:1743github.com

Contribute to AutomataNexus/AxonML development by creating an account on GitHub.

NameName
AxonML Logo

A complete, PyTorch-equivalent machine learning framework written in pure Rust.

Crates.io Docs.rs Downloads License Rust

Axonml (named after axons - the nerve fibers that transmit signals between neurons) is an ambitious open-source project to create a complete machine learning framework in Rust. Our goal is to provide the same comprehensive functionality as PyTorch while leveraging Rust's performance, safety, and concurrency guarantees.

AxonML provides comprehensive PyTorch-equivalent functionality with 1,988 passing tests. Several features go beyond PyTorch with novel capabilities not available in any other framework.

  • Tensor Operations (axonml-tensor)

    • N-dimensional tensors with arbitrary shapes
    • Automatic broadcasting following NumPy rules
    • Efficient views and slicing (zero-copy where possible)
    • Arithmetic operations (+, -, *, /, matmul)
    • Reduction operations (sum, mean, max, min, prod)
    • Sorting operations (sort, argsort, topk)
    • Indexing operations (gather, scatter, nonzero, unique)
    • Shape operations (flip, roll, squeeze, unsqueeze, permute)
    • Activation functions (ReLU, Sigmoid, Tanh, Softmax, GELU, SiLU, ELU, LeakyReLU)
    • Sparse tensor support (COO format)
    • Lazy Tensor Computation (novel) - Deferred execution with algebraic optimization (constant folding, identity elimination, inverse cancellation, scalar folding) — built into the tensor type, no external JIT needed
  • Automatic Differentiation (axonml-autograd)

    • Dynamic computational graph
    • Reverse-mode autodiff (backpropagation)
    • Gradient functions for all operations
    • no_grad context manager
    • Automatic Mixed Precision (AMP) - autocast context for F16 training
    • Gradient Checkpointing - trade compute for memory
    • Graph Inspection API (novel) - Native computation graph visualization and analysis (trace_backward, DOT export, node/depth/leaf counting, gradient flow summary) — no external tools needed (unlike PyTorch's torchviz)
  • Neural Networks (axonml-nn)

    • Module trait with train/eval modes
    • Linear, Conv1d/2d, MaxPool, AvgPool, AdaptiveAvgPool
    • BatchNorm1d/2d, LayerNorm, GroupNorm, InstanceNorm2d
    • Dropout
    • RNN, LSTM, GRU (with cell variants)
    • MultiHeadAttention, Embedding
    • Loss functions (MSE, CrossEntropy, BCE, BCEWithLogits, L1, SmoothL1, NLL)
    • Parameter initialization (Xavier, Kaiming, Orthogonal, etc.)
    • Differentiable Structured Sparsity (novel) - SparseLinear with learnable pruning masks via soft thresholding, GroupSparsity regularization, and LotteryTicket hypothesis implementation — the pruning mask is differentiable, enabling end-to-end learning of which weights to prune
  • Optimizers (axonml-optim)

    • SGD with momentum and Nesterov
    • Adam, AdamW, RMSprop
    • LAMB - Layer-wise Adaptive Moments for large batch training
    • GradScaler - Gradient scaling for mixed precision
    • LR Schedulers (Step, Cosine, OneCycle, Warmup, ReduceLROnPlateau, MultiStep, Exponential)
    • Training Health Monitor (novel) - Real-time training diagnostics: NaN/gradient explosion/vanishing detection, loss trend analysis (decreasing/stable/increasing/oscillating), dead neuron tracking, convergence detection, automatic learning rate suggestions — the optimizer monitors its own health
  • Data Loading (axonml-data)

    • Dataset trait and DataLoader
    • Batching and shuffling
    • Sequential and random samplers
  • Computer Vision (axonml-vision)

    • Image transforms (Resize, Crop, Flip, Normalize)
    • SyntheticMNIST, SyntheticCIFAR datasets
    • LeNet, SimpleCNN, ResNet, VGG, ViT architectures
    • Aegis Identity (novel) — Unified biometric framework (~362K params, <2MB) with 5 novel architectures:
      • Mnemosyne - Face identity via temporal crystallization (GRU attractor convergence, liveness detection)
      • Ariadne - Fingerprint via ridge event fields (Gabor wavelet banks, singularity detection)
      • Echo - Voice via predictive speaker residuals (identity = what can't be predicted)
      • Argus - Iris via polar-native radial phase encoding (rotation-invariant matching)
      • Themis - Multimodal belief propagation fusion (uncertainty-aware, not score averaging)
      • Forensic verification, batch ops, drift detection, quality gating, operating curves
      • Each modality deployable independently on Raspberry Pi
    • Object Detection Training Infrastructure (novel)
      • Image I/O: load_image, load_image_resized, rgb_bytes_to_tensor (CHW, [0,1] normalized)
      • Dataset loaders: CocoDataset (COCO JSON, category remapping), WiderFaceDataset (WIDER FACE annotations)
      • Detection losses: FocalLoss, GIoULoss, UncertaintyLoss, compute_centerness
      • FCOS target assignment (multi-scale) and Phantom target assignment (single-scale)
      • Training loops: nexus_training_step(), phantom_training_step() (full forward→loss→backward→step)
      • Evaluation: compute_ap, compute_map, compute_coco_map (AP/mAP at IoU thresholds)
    • Nexus (novel) — Dual-pathway object detector (~430K params) with predictive coding, persistent GRU object memory, uncertainty quantification, and 3-scale anchor-free heads
    • Phantom (novel) — Event-driven face detector (~126K params) with sparse processing, GRU face tracking, and confidence accumulation. Compute drops to ~5% in steady state
  • Audio Processing (axonml-audio)

    • MelSpectrogram, MFCC transforms
    • Resample, Normalize, AddNoise
    • SyntheticCommandDataset, SyntheticMusicDataset
  • NLP Utilities (axonml-text)

    • Tokenizers (Whitespace, Char, BPE)
    • Vocabulary management
    • SyntheticSentimentDataset
  • Distributed Training (axonml-distributed)

    • DistributedDataParallel (DDP) - Data parallelism across GPUs
    • Fully Sharded Data Parallel (FSDP) - ZeRO-2/ZeRO-3 memory optimization
    • Pipeline Parallelism - Model sharding across devices with microbatching
    • Tensor Parallelism - Layer-wise model parallelism
    • All-reduce, broadcast, barrier, send/recv collective operations
    • Process group management with multiple backends
  • Model Serialization (axonml-serialize)

    • Save/load models in multiple formats
    • Checkpoint management for training
    • StateDict (PyTorch-compatible concept)
    • SafeTensors format support
  • ONNX Import/Export (axonml-onnx)

    • Load ONNX models for inference
    • Export Axonml models to ONNX format
    • 40+ ONNX operators supported
    • ONNX opset version 17
  • Model Quantization (axonml-quant)

    • INT8 (Q8_0), INT4 (Q4_0, Q4_1), INT5 (Q5_0, Q5_1) formats
    • Half-precision (F16) support
    • Block-based quantization with calibration
    • ~8x model size reduction with Q4
  • Kernel Fusion (axonml-fusion)

    • Automatic fusion pattern detection
    • FusedLinear (MatMul + Bias + Activation)
    • FusedElementwise operation chains
    • Up to 2x speedup for memory-bound operations
  • Command Line Interface (axonml-cli)

    • Complete CLI for ML workflows
    • Real training with axonml components
    • Weights & Biases integration for experiment tracking
    • Model conversion and export
  • Terminal User Interface (axonml-tui)

    • Interactive terminal-based dashboard
    • Model architecture visualization
    • Real-time training progress monitoring
    • Dataset statistics and graphs
    • File browser for models and datasets
  • Web Dashboard (axonml-dashboard)

    • Modern Leptos/WASM web frontend
    • Real-time training monitoring with WebSocket
    • Model registry and version management
    • Inference endpoint deployment
    • Multi-factor authentication (TOTP, WebAuthn)
  • API Server (axonml-server)

    • Axum-based REST API backend
    • JWT authentication with refresh tokens
    • Training run management
    • Model registry and deployment
    • WebSocket terminal (PTY) for in-browser shell access
    • Prometheus metrics export

The Axonml CLI provides a unified command-line interface for the entire ML workflow:

# Server Sync (CLI ↔ Webapp Integration)
axonml login # Login to AxonML server
axonml login --server http://server:3021 # Login to custom server
axonml logout # Logout and clear credentials
axonml sync # Check sync status with server
axonml sync --full # Full sync of training runs, models, datasets # Project Management
axonml new my-model # Scaffold new project
axonml init # Initialize in existing directory
axonml scaffold my-project # Generate Rust training project # Training (with real axonml integration)
axonml train config.toml # Train from config file
axonml train --model mlp --epochs 10 # Quick training
axonml resume checkpoint.axonml # Resume from checkpoint # Evaluation & Inference
axonml eval model.axonml --data test/ # Evaluate model metrics
axonml predict model.axonml input.json # Run inference # Model Management
axonml convert pytorch.pth # Convert PyTorch models
axonml export model.axonml --onnx # Export to ONNX
axonml inspect model.axonml # Inspect architecture
axonml rename model.axonml new-name # Rename model files # Quantization
axonml quant convert model.axonml --type q8_0 # Quantize to Q8
axonml quant convert model.pth --type q4_0 # PyTorch → Quantized Axonml
axonml quant info model.axonml # Show quantization info
axonml quant benchmark model.axonml # Benchmark quantized model
axonml quant list # List supported formats # Workspace Management
axonml load model model.axonml # Load model into workspace
axonml load data ./dataset # Load dataset into workspace
axonml load both --model m.f --data d/ # Load both
axonml load status # Show workspace status
axonml load clear # Clear workspace # Analysis & Reports
axonml analyze model # Analyze loaded model
axonml analyze data # Analyze loaded dataset
axonml analyze both # Analyze both
axonml analyze report --format html # Generate analysis report # Data Management
axonml data info ./dataset # Dataset information
axonml data validate ./dataset # Validate dataset format
axonml data split ./data --train 0.8 # Split dataset # Bundling & Deployment
axonml zip create -o bundle.zip --model m.f --data d/ # Create bundle
axonml zip extract bundle.zip -o ./output # Extract bundle
axonml zip list bundle.zip # List bundle contents
axonml upload model.axonml --hub myrepo # Upload to model hub
axonml serve model.axonml --port 8080 # Start inference server # Benchmarking
axonml bench model model.axonml # Benchmark model performance
axonml bench inference model.axonml # Test batch size scaling
axonml bench compare model1.f,model2.f # Compare multiple models
axonml bench hardware # CPU/memory benchmarks # GPU Management
axonml gpu list # List available GPUs
axonml gpu info # Detailed GPU information
axonml gpu select 0 # Select GPU for training
axonml gpu bench # GPU compute benchmarks
axonml gpu memory # Show GPU memory usage
axonml gpu status # Current GPU status # Pretrained Model Hub
axonml hub list # List available pretrained models
axonml hub info resnet50 # Show model details
axonml hub download resnet50 # Download pretrained weights
axonml hub cached # Show cached models
axonml hub clear # Clear all cached weights # Kaggle Integration
axonml kaggle login <username> <key> # Save Kaggle API credentials
axonml kaggle status # Check authentication status
axonml kaggle search "image classification" # Search datasets
axonml kaggle download owner/dataset # Download dataset
axonml kaggle list # List downloaded datasets # Dataset Management (NexusConnectBridge)
axonml dataset list # List available datasets
axonml dataset list --source kaggle # List from specific source
axonml dataset info mnist # Show dataset details
axonml dataset search "classification" # Search datasets
axonml dataset download cifar-10 # Download dataset
axonml dataset sources # List data sources # Dashboard & Server Management
axon start # Start dashboard + API server
axon start --server # Start only API server on :3000
axon start --dashboard # Start only dashboard on :8080
axon stop # Stop all services
axon status # Check service status
axon logs -f # Follow logs in real-time

Built-in experiment tracking with W&B:

# Configure W&B
axonml wandb login
axonml wandb init --project my-project # Training automatically logs to W&B
axonml train config.toml --wandb

Features:

  • Automatic metric logging (loss, accuracy, learning rate)
  • Hyperparameter tracking
  • Model checkpointing with W&B artifacts
  • Real-time training visualization

The Axonml TUI provides an interactive terminal-based dashboard for ML development:

# Launch the TUI
axonml tui # Load a model on startup
axonml tui --model path/to/model.axonml # Load a dataset on startup
axonml tui --data path/to/dataset/ # Load both
axonml tui --model model.axonml --data ./data/

Views:

  • Model - Neural network architecture visualization (layers, shapes, parameters)
  • Data - Dataset statistics, class distributions, sample preview
  • Training - Real-time epoch/batch progress, loss/accuracy metrics
  • Graphs - Loss curves, accuracy curves, learning rate schedule
  • Files - File browser for models and datasets
  • Help - Keyboard shortcuts reference

Keyboard Navigation:

Key Action
Tab / Shift+Tab Switch between tabs
1-5 Jump directly to tab
↑/k, ↓/j Navigate up/down in lists
←/h, →/l Navigate between panels
Enter Select / Open
? Show help overlay
q Quit

The AxonML Web Dashboard provides a modern browser-based interface for ML operations:

# Start the full stack (dashboard + API server)
axon start # Start only the API server
axon start --server --port 3000 # Start only the dashboard
axon start --dashboard --dashboard-port 8080 # Check status
axon status # View logs
axon logs -f

Features:

  • Dashboard Overview - Real-time stats on training runs, models, and endpoints
  • Training Runs - Start, monitor, and manage training with live metrics
  • Model Registry - Upload, version, and manage trained models
  • Inference Endpoints - Deploy models for serving predictions
  • In-App Terminal - Slide-out terminal with WebSocket PTY for server-side commands
  • Settings - User profile, security settings, MFA configuration

Authentication:

  • JWT-based authentication with refresh tokens
  • Multi-factor authentication (TOTP authenticator apps)
  • WebAuthn support for hardware security keys
  • Recovery codes for account recovery

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                    axonml-dashboard                          │
│              Leptos/WASM Frontend (CSR)                      │
├─────────────────────────────────────────────────────────────┤
│  Dashboard │ Training │ Models │ Inference │ Settings       │
└─────────────────────────────────────────────────────────────┘
                              │
                         HTTP/WebSocket
                              │
┌─────────────────────────────────────────────────────────────┐
│                      axonml-server                           │
│                    Axum REST + WS API                        │
├─────────────────────────────────────────────────────────────┤
│  Auth  │  Training  │  Models  │  Inference  │  Metrics     │
└─────────────────────────────────────────────────────────────┘
  • Pretrained Model Hub (axonml-vision/hub)

    • Download pretrained weights (ResNet, VGG)
    • Local caching in ~/.cache/axonml/hub/
    • StateDict for named tensor storage
    • CLI: axonml hub list/info/download/cached/clear
  • Kaggle Integration (axonml-cli)

    • Kaggle API authentication
    • Dataset search and download
    • CLI: axonml kaggle login/status/search/download/list
  • Dataset Management (axonml-cli)

    • NexusConnectBridge API integration
    • Built-in datasets (MNIST, CIFAR, Iris, Wine, etc.)
    • Multiple data sources (Kaggle, UCI, data.gov)
    • CLI: axonml dataset list/info/search/download/sources
  • JIT Compilation (axonml-jit)

    • Intermediate representation for computation graphs
    • Operation tracing and graph building
    • Graph optimization (constant folding, DCE, CSE)
    • Function caching for compiled graphs
    • Cranelift foundation for native codegen
  • Profiling Tools (axonml-profile)

    • Core Profiler with ProfileGuard and ProfileReport
    • MemoryProfiler for allocation tracking
    • ComputeProfiler for operation timing
    • TimelineProfiler with Chrome trace export
    • BottleneckAnalyzer for automatic issue detection
  • LLM Architectures (axonml-llm)

    • BERT encoder (BertConfig, Bert, BertLayer)
    • BertForSequenceClassification, BertForMaskedLM
    • GPT-2 decoder (GPT2Config, GPT2, GPT2Block)
    • GPT2LMHead for language modeling
    • Text generation with top-k, top-p, temperature sampling
    • Pretrained Model Hub - LLaMA, Mistral, Phi, Qwen model configs
  • GPU Backends (axonml-core)

    • CUDA - Full NVIDIA GPU support with cuBLAS, PTX kernels
    • Vulkan - Cross-platform GPU compute
    • Metal - Apple Silicon optimization
    • WebGPU - Browser-based GPU acceleration
    • GPU Test Suite - Comprehensive correctness testing with CPU reference
  • Model Hub & Benchmarking (axonml)

    • Unified Model Hub - Combined vision/LLM model registry
    • Model Benchmarking - Throughput testing, memory profiling
    • Pretrained Weights - ResNet, VGG, MobileNet, EfficientNet, BERT, GPT-2
  • Real-time model serving with batched inference
  • Self-hosted pretrained weight hosting

Add Axonml to your Cargo.toml:

[dependencies]
axonml = "0.4"
use axonml::prelude::*; fn main() { // Create tensors let a = zeros::<f32>(&[2, 3]); let b = ones::<f32>(&[2, 3]); // Arithmetic operations with broadcasting let c = &a + &b; let d = &c * 2.0; // Matrix operations let e = randn::<f32>(&[3, 4]); let f = randn::<f32>(&[4, 5]); let g = e.matmul(&f).unwrap(); // Reductions let sum = d.sum(); let mean = d.mean().unwrap(); // Activations let h = randn::<f32>(&[10]); let activated = h.relu(); println!("Result shape: {:?}", g.shape());
}
use axonml::prelude::*;
use axonml_nn::{Sequential, Linear, ReLU, CrossEntropyLoss, Module};
use axonml_optim::{Adam, Optimizer};
use axonml_data::{DataLoader, Dataset}; fn main() { // Build model let model = Sequential::new() .add(Linear::new(784, 256)) .add(ReLU) .add(Linear::new(256, 10)); // Setup optimizer let mut optimizer = Adam::new(model.parameters(), 0.001); // Training loop for epoch in 0..10 { for batch in dataloader.iter() { let output = model.forward(&batch.data); let loss = CrossEntropyLoss::new().compute(&output, &batch.targets); optimizer.zero_grad(); loss.backward(); optimizer.step(); } }
}
use axonml::prelude::*; // Zeros and ones
let z = zeros::<f32>(&[2, 3, 4]);
let o = ones::<f64>(&[5, 5]); // Random tensors
let r = rand::<f32>(&[10, 10]); // Uniform [0, 1)
let n = randn::<f32>(&[10, 10]); // Normal(0, 1)
let u = uniform::<f32>(&[5], -1.0, 1.0); // Ranges
let a = arange::<f32>(0.0, 10.0, 1.0);
let l = linspace::<f32>(0.0, 1.0, 100); // From data
let t = Tensor::<f32>::from_vec(vec![1.0, 2.0, 3.0], &[3]).unwrap(); // Special matrices
let eye = eye::<f32>(4);
let diag = diag(&[1.0, 2.0, 3.0]);
use axonml::prelude::*; let t = randn::<f32>(&[2, 3, 4]); // Reshape
let r = t.reshape(&[6, 4]).unwrap();
let f = t.flatten(); // Transpose
let p = t.permute(&[2, 0, 1]).unwrap(); // Squeeze/Unsqueeze
let s = t.unsqueeze(0).unwrap(); // Add dimension
let u = s.squeeze(Some(0)).unwrap(); // Remove dimension // Views
let v = t.slice_dim0(0, 1).unwrap();
let n = t.narrow(1, 0, 2).unwrap();

AxonML powers real-time predictive maintenance on HVAC systems across commercial buildings. 12 models (6 LSTM autoencoders for anomaly detection + 6 GRU failure predictors) run live inference on Raspberry Pi edge controllers, processing sensor data at 1 Hz.

Building Unit Anomaly Detector Failure Predictor Params RSS
FCOG Mechroom Erebus (LSTM-AE) Kairos (GRU-FDD) 416K 2.5 MB
Warren AHU-1 Aether Moros 105K 2.1 MB
Warren AHU-2 Phanes Hecate 233K 2.4 MB
Warren AHU-4 Nyctos Cassandra 105K 2.1 MB
Warren AHU-7 Poseidon Triton 105K 2.1 MB
Huntington Mechroom Plutus Moira 415K 3.2 MB

Stack: AxonML training (CPU) → .axonml model files → cross-compiled ARM inference daemons (armv7-unknown-linux-musleabihf) → PM2-managed services on Raspberry Pi → REST API (/api/inference/latest)

Each daemon runs pure-tensor inference (no autograd overhead), polls local NexusEdge for sensor data, maintains rolling time-series buffers, and exposes anomaly scores + failure predictions via HTTP.

+------------------------------------------------------------------+
|                        axonml (main crate)                       |
+------------------------------------------------------------------+
|  axonml-vision | axonml-audio | axonml-text | axonml-distributed |
+-----------------+---------------+--------------+--------------------+
|    axonml-llm   |  axonml-jit  | axonml-profile                   |
+-----------------+--------------+-----------------------------------+
|         axonml-serialize       |         axonml-onnx             |
+---------------------------------+----------------------------------+
|         axonml-quant           |         axonml-fusion           |
+---------------------------------+----------------------------------+
|                           axonml-data                             |
+---------------------------------+----------------------------------+
|          axonml-optim          |           axonml-nn             |
+---------------------------------+----------------------------------+
|                          axonml-autograd                          |
+--------------------------------------------------------------------+
|                          axonml-tensor                            |
+--------------------------------------------------------------------+
|                           axonml-core                             |
+--------------+--------------+--------------+--------------+---------+
|   CPU/BLAS   |    CUDA      |   Vulkan     |    Metal     | WebGPU  |
+--------------+--------------+--------------+--------------+---------+

+--------------------------------------------------------------------+
|                           axonml-cli                              |
|     Project scaffolding, Training, Evaluation, W&B integration     |
+--------------------------------------------------------------------+
|                           axonml-tui                              |
|  Interactive terminal dashboard for models, data, training graphs  |
+--------------------------------------------------------------------+
|                        axonml-dashboard                            |
|  Leptos/WASM Web UI: Training, Models, Inference, Settings         |
+--------------------------------------------------------------------+
|                         axonml-server                              |
|  Axum REST API: Auth, Training Runs, Model Registry, Metrics       |
+--------------------------------------------------------------------+
  • Rust 1.75 or later
  • Cargo
  • Node.js (for PM2 process management)
  • Aegis-DB (document store database)
git clone https://github.com/automatanexus/axonml
cd axonml
cargo build --release
cargo install --path crates/axonml-cli

AxonML server is managed via PM2 for automatic restarts and boot persistence.

# First-time setup
cargo build --release -p axonml-server # Build release binary
sudo mkdir -p /var/log/axonml # Create log directory
sudo chown $USER:$USER /var/log/axonml # Initialize database (creates collections + users)
./AxonML_DB_Init.sh --with-user # Start with PM2
pm2 start ecosystem.config.js
pm2 save # Save process list
pm2 startup # Enable boot persistence # Management
pm2 status # Check status
pm2 logs axonml-server # View logs
pm2 restart axonml-server # Restart server
pm2 stop axonml-server # Stop server

AxonML uses Aegis-DB as its document store.

# Initialize database (run once or to reinitialize)
./AxonML_DB_Init.sh # Basic setup with admin user
./AxonML_DB_Init.sh --with-user # Also creates DevOps admin user # Default Users
# Admin: admin@axonml.local / admin
# DevOps: DevOps@automatanexus.com / Invertedskynet2$
Variable Default Description
RUST_LOG info Log level (trace, debug, info, warn, error)
AEGIS_URL http://127.0.0.1:7001 Aegis-DB connection URL
RESEND_API_KEY - Email service API key
Axonml/
├── Cargo.toml              # Workspace configuration
├── README.md               # This file
├── LICENSE-MIT             # MIT license
├── LICENSE-APACHE          # Apache 2.0 license
├── CONTRIBUTING.md         # Contribution guidelines
├── CHANGELOG.md            # Version history
├── COMMERCIAL.md           # Commercial licensing info
├── Axonml_Architecture.md # Architecture documentation
├── crates/
│   ├── axonml-core/       # Device, storage, dtypes
│   ├── axonml-tensor/     # Tensor operations
│   ├── axonml-autograd/   # Automatic differentiation
│   ├── axonml-nn/         # Neural network modules
│   ├── axonml-optim/      # Optimizers & schedulers
│   ├── axonml-data/       # Data loading
│   ├── axonml-vision/     # Computer vision
│   ├── axonml-audio/      # Audio processing
│   ├── axonml-text/       # NLP utilities
│   ├── axonml-distributed/# Distributed training
│   ├── axonml-serialize/  # Model serialization
│   ├── axonml-onnx/       # ONNX import/export
│   ├── axonml-quant/      # Model quantization
│   ├── axonml-fusion/     # Kernel fusion optimization
│   ├── axonml-jit/        # JIT compilation
│   ├── axonml-profile/    # Profiling tools
│   ├── axonml-llm/        # LLM architectures (BERT, GPT-2)
│   ├── axonml-cli/        # Command line interface
│   ├── axonml-tui/        # Terminal user interface
│   ├── axonml-dashboard/  # Leptos/WASM web dashboard
│   ├── axonml-server/     # Axum API server
│   └── axonml/            # Main umbrella crate
├── docs/                   # Per-module documentation
└── examples/               # Working examples
    ├── simple_training.rs  # XOR with MLP
    ├── mnist_training.rs   # CNN on MNIST
    └── nlp_audio_test.rs   # Text & audio demo

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

The framework includes 1,988 tests across all crates:

Crate Tests
axonml-core 31
axonml-tensor 98
axonml-autograd 105
axonml-nn 171
axonml-optim 79
axonml-data 55
axonml-vision 607
axonml-audio 37
axonml-text 43
axonml-distributed 83
axonml-serialize 31
axonml-onnx 28
axonml-quant 26
axonml-fusion 31
axonml-jit 27
axonml-profile 27
axonml-llm 73
axonml-server 120
axonml-cli 74 (unit) + 37 (integration)
axonml-tui 14
axonml (umbrella) 25 (unit + integration)

Licensed under either of:

at your option.

Axonml - Forging the future of ML in Rust.

You can’t perform that action at this time.


Read the original article

Comments

  • By jacobn 2026-02-2823:261 reply

    Cool! How do you actually implement “Reverse-mode automatic differentiation with a tape-based computational graph” in rust?

    • By AutomataNexus 2026-03-0118:55

      Hijacob, AxonML author here. Our autograd is ~3K lines of Rust. Here's the actual architecture:

        Three core pieces:
      
        1. The GradientFunction trait — every differentiable op implements this:
      
        pub trait GradientFunction: Debug + Send + Sync {
            // Given dL/d(output), compute dL/d(each input)
            fn apply(&self, grad_output: &Tensor<f32>) -> Vec<Option<Tensor<f32>>>;
            // Linked list of parent grad functions (the "tape" edges)
            fn next_functions(&self) -> &[Option<GradFn>];
            fn name(&self) -> &'static str;
        }
      
        GradFn is just an Arc<dyn GradientFunction> wrapper — cheap to clone, identity via Arc pointer address.
      
        2. Forward pass builds the graph implicitly. Every op creates a backward node with saved tensors + links to its
        inputs' grad functions:
      
        // Multiplication: d/dx(x*y) = y, d/dy(x*y) = x
        pub struct MulBackward {
            next_fns: Vec<Option<GradFn>>,  // parent grad functions
            saved_lhs: Tensor<f32>,         // saved for backward
            saved_rhs: Tensor<f32>,
        }
      
        impl GradientFunction for MulBackward {
            fn apply(&self, grad_output: &Tensor<f32>) -> Vec<Option<Tensor<f32>>> {
                let grad_lhs = grad_output.mul(&self.saved_rhs).unwrap();
                let grad_rhs = grad_output.mul(&self.saved_lhs).unwrap();
                vec![Some(grad_lhs), Some(grad_rhs)]
            }
            fn next_functions(&self) -> &[Option<GradFn>] { &self.next_fns }
        }
      
        The Variable wrapper connects it:
      
        pub fn mul_var(&self, other: &Variable) -> Variable {
            let result = self.data() * other.data();
            let grad_fn = GradFn::new(MulBackward::new(
                self.grad_fn.clone(),   // link to lhs's grad_fn
                other.grad_fn.clone(),  // link to rhs's grad_fn
                self.data(), other.data(),  // save for backward
            ));
            Variable::from_operation(result, grad_fn, true)
        }
      
        3. Backward pass = DFS topological sort, then reverse walk. This is the whole engine:
      
        pub fn backward(output: &Variable, grad_output: &Tensor<f32>) {
            let grad_fn = output.grad_fn().unwrap();
      
            // Topological sort via post-order DFS
            let mut topo_order = Vec::new();
            let mut visited = HashSet::new();
            build_topo_order(&grad_fn, &mut topo_order, &mut visited);
      
            // Walk in reverse, accumulate gradients
            let mut grads: HashMap<GradFnId, Tensor<f32>> = HashMap::new();
            grads.insert(grad_fn.id(), grad_output.clone());
      
            for node in topo_order.iter().rev() {
                let grad = grads.get(&node.id()).unwrap();
                let input_grads = node.apply(&grad);  // chain rule
      
                for (i, next_fn) in node.next_functions().iter().enumerate() {
                    if let Some(next) = next_fn {
                        if let Some(ig) = &input_grads[i] {
                            grads.entry(next.id())
                                .and_modify(|g| *g = g.add(ig).unwrap())  // accumulate
                                .or_insert(ig.clone());
                        }
                    }
                }
            }
        }
      
        Leaf variables use AccumulateGrad — a special GradientFunction that writes the gradient into the Variable's shared
        Arc<RwLock<Option<Tensor>>> instead of propagating further. That's how x.grad() works after backward.
      
        Key Rust-specific decisions:
      
        - Thread-local graph (thread_local! + HashMap<NodeId, GraphNode>) — no global lock contention, each thread gets its
        own tape
        - Arc<dyn GradientFunction> for the linked-list edges — trait objects give polymorphism, Arc gives cheap cloning and
        stable identity (pointer address = node ID)
        - parking_lot::RwLock over std::sync — faster uncontended reads for the gradient accumulators
        - Graph cleared after backward (like PyTorch's retain_graph=False) — we learned this the hard way when GRU training
        with 120 timesteps leaked ~53GB via accumulated graph nodes
      
        The "tape" isn't really a flat tape — it's a DAG of GradFn nodes linked via next_functions(). The topological sort
        flattens it into an execution order at backward time. This is the same design as PyTorch's C++ autograd engine, just
        in Rust with ownership semantics doing a lot of the memory safety work for free.

  • By AutomataNexus 2026-02-2823:17

    Hi HN. I've been building AxonML for a bit now, testing often, and it's at v0.3.3 now -- 22 crates, 336 Rust source files, 1,076+ passing tests. It's a from-scratch ML framework in pure Rust aiming for PyTorch parity, dual licensed MIT/Apache-2.0.

    I'm sharing it because I think the "Rust for ML" space is still underexplored relative to its potential, and I wanted to show what one person building full-time can produce.

    ### What's built

    The full stack, bottom to top:

    *Core compute:* N-dimensional tensors with broadcasting (NumPy rules), arbitrary shapes, views, slicing. Reverse-mode automatic differentiation with a tape-based computational graph. GPU backends for CUDA (GPU-resident tensors, cuBLAS GEMM, 20+ element-wise kernels with automatic dispatch), Vulkan, Metal, and WebGPU.

    *Neural networks:* Linear, Conv1d/2d, MaxPool, AvgPool, AdaptiveAvgPool, BatchNorm1d/2d, LayerNorm, GroupNorm, InstanceNorm2d, Dropout, RNN/LSTM/GRU (with cell variants), MultiHeadAttention, CrossAttention, full Transformer encoder/decoder, Seq2SeqTransformer, Embedding. Loss functions: MSE, CrossEntropy, BCE, BCEWithLogits, L1, SmoothL1, NLL. Initialization: Xavier, Kaiming, Orthogonal.

    *Optimizers:* SGD (with momentum/Nesterov), Adam, AdamW, RMSprop, Adagrad, LBFGS, LAMB. GradScaler for mixed precision. LR schedulers: Step, Cosine, OneCycle, Warmup, ReduceLROnPlateau, MultiStep, Exponential.

    *Distributed training:* DDP, Fully Sharded Data Parallel (ZeRO-2/ZeRO-3), Pipeline Parallelism with microbatching, Tensor Parallelism.

    *LLM architectures:* BERT (encoder, sequence classification, masked LM), GPT-2 (decoder, LM head), LLaMA (RMSNorm, RotaryEmbedding, GroupedQueryAttention), Mistral, Phi. Text generation with top-k, top-p, temperature sampling. Pretrained model hub configs.

    *Ecosystem tooling:* ONNX import/export (40+ operators, opset 17), model quantization (INT4/INT5/INT8/F16, block-based with calibration, ~8x size reduction at Q4), kernel fusion (automatic pattern detection, FusedLinear, up to 2x on memory-bound ops), JIT compilation (graph optimization, Cranelift foundation), profiling (timeline with Chrome trace export, bottleneck analyzer).

    *Vision/Audio/NLP:* ResNet, VGG, ViT architectures, image transforms, MFCC/spectrogram, BPE tokenizer, vocabulary management.

    *Full application stack:* CLI with 50+ commands, terminal UI (ratatui-based dashboard), web dashboard (Leptos/WASM with WebSocket), Axum REST API server with JWT auth, MFA (TOTP + WebAuthn), model registry, inference endpoint deployment, in-browser terminal via WebSocket PTY, Prometheus metrics, Weights & Biases integration, Kaggle integration.

    I estimate PyTorch parity at roughly 92-95% for the core training loop and standard layer types.

    ### Production deployment -- this is the part I'm most proud of

    AxonML is running live production inference right now. 12 HVAC predictive maintenance models (LSTM autoencoders for anomaly detection + GRU failure predictors) are deployed across 6 Raspberry Pi edge controllers, monitoring commercial building equipment across 5 facilities. Each model is cross-compiled to `armv7-unknown-linux-musleabihf` (static musl), runs as a PM2-managed daemon at ~2-3 MB RSS, and exposes predictions via REST API at 1 Hz.

    Beyond those initial 6 controllers, I've built out models for 35 HVAC areas across 7 facilities (FCOG, Warren, Huntington, Akron, Hopebridge, NE Realty, and a unified NexusBMS system with 22 trained models covering air handlers, boilers, chillers, VAVs, fan coils, make-up air units, DOAS units, pumps, and steam systems). 69 `.axonml` model files total.

    The deployment pipeline: AxonML training on CPU --> `.axonml` serialized weights --> cross-compiled ARM inference binary (pure tensor ops, no autograd overhead) --> PM2 process management on the Pi --> HTTP endpoints for integration with the building management system.

    This is the use case that drove most of the framework's development. The models needed to be small, fast, and run on constrained hardware without Python.

    ### Kaggle competition usage

    I'm also using AxonML for the Deep Past Initiative Kaggle competition -- machine translation from Akkadian cuneiform to English. Full seq2seq Transformer (encoder-decoder with multi-head attention, sinusoidal positional encoding, BPE tokenization) trained on ~1,561 parallel sentence pairs. It compiles and trains end-to-end through AxonML. Evaluated on BLEU + chrF++.

    ### Honest limitations

    - *Ecosystem maturity.* PyTorch has thousands of contributors, Hugging Face, torchvision's pretrained zoo, a decade of Stack Overflow answers. AxonML has one developer and a growing but small set of pretrained weights. If you need a specific pretrained model, you'll probably need to convert it yourself via ONNX - *GPU kernel coverage.* CUDA support works -- cuBLAS GEMM, 20+ element-wise kernels, GPU-resident tensors -- but the coverage is nowhere near cuDNN-backed PyTorch. Some operations will fall back to CPU. Vulkan/Metal/WebGPU backends are implemented but less battle-tested than CUDA - *Python interop doesn't exist.* If your workflow depends on pandas, scikit-learn preprocessing, or Jupyter notebooks, you'll need to handle data prep separately. This is a Rust-native framework

    ### Why Rust for ML?

    Three reasons from practical experience:

    1. *Single-binary deployment.* `cargo build --release --target armv7-unknown-linux-musleabihf` gives you a statically-linked inference binary. No Python runtime, no pip, no conda, no Docker. Copy it to a Raspberry Pi and it runs. This is why my HVAC models actually work in production 2. *Compile-time safety.* Dimension mismatches, type errors, and lifetime issues are caught before you start a training run, not 3 hours into one 3. *Memory predictability.* No GC pauses, no reference counting overhead on the hot path, deterministic memory layout. On a Raspberry Pi with 1 GB RAM running at 2-3 MB RSS, this matters

    GitHub: https://github.com/AutomataNexus/AxonML

    Happy to answer questions about the architecture, the borrow-checker-vs-autograd challenges, the edge deployment pipeline, or the Kaggle experience.

HackerNews