Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference

2026-03-1218:526934ionrouter.io

Cumulus LabsIonAttention Engine

NVIDIA Grace Hopper Superchip
IonAttention Engine

Our custom inference stack multiplexes models on a single GPU, swaps in ms, and adapts to traffic in real time. Built from the ground up for Grace Hopper.

Throughput (tok/s)Single GH200, Qwen2.5-7B
Top inference provider~3,000
Read the deep dive
Custom Models

Deploy your finetunes, custom LoRAs, or any open-source model on our fleet. Dedicated GPU streams with no cold starts and per-second billing.

Book a call
What Teams Build on Ion

Teams use Ion for highest performance robotics perception, multi-camera surveillance, game asset generation, and AI video pipelines.

5 VLMs, 1 GPU.

Five vision-language models on a single GPU — 2,700 video clips, concurrent users, <1s cold starts.

Read the case study
API · Zero Code Changes

Point your existing OpenAI client at Ion. Any language, any framework. One line change.

Pay per million tokens. No idle costs.

ZhiPu AI's flagship 600B+ MoE model with state-of-the-art reasoning, coding, and multilingual capabilities, powered by EAGLE speculative decoding on 8x B200 GPUs.

~220 tok/s$1.20 in · $3.50 out

MoonShot AI's frontier reasoning model designed for long document understanding, multi-step reasoning chains, and complex problem decomposition across technical and scientific domains.

~120 tok/s$0.20 in · $1.60 out
Try in Playground

MiniMax's flagship 1M-context language model delivering strong reasoning and instruction following across long documents, multi-turn dialogue, and complex analysis.

~120 tok/s$0.40 in · $1.50 out
Try in Playground
Qwen3.5-122B-A10BLanguage

Cumulus's most capable open-source model — a 122B MoE with 10B active parameters rivaling leading proprietary models on coding, reasoning, and multilingual benchmarks.

~120 tok/s$0.20 in · $1.60 out
Try in Playground

A frontier open-source 120B model delivering cutting-edge reasoning and instruction following comparable to leading closed-source systems, ideal for complex agentic workflows and advanced code generation.

~100 tok/s$0.020 in · $0.095 out
Try in Playground
Wan2.2 Text-to-VideoVideo

A 14B text-to-video model optimized for speed via the FastGen runtime, generating clips in under 10 seconds with strong motion coherence.

~8s/clip$0.00194 / GPU·sec
Try in Playground

Black Forest Labs' fastest Flux model, delivering crisp sub-4-second image generation ideal for real-time applications, prototyping, and high-volume pipelines.

~3s/image~$0.005 in · per image out
Try in Playground


Read the original article

Comments

  • By GodelNumbering 2026-03-1219:322 reply

    As an inference hungry human, I am obviously hooked. Quick feedback:

    1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it

    2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!

    • By 2uryaa 2026-03-1221:00

      Thank you for the feedback! I think we will definitely redo the info on the frontpage to reorg and show quantizations better. For reference, Kimi and Minimax are NVFP4. The rest are FP8. But I will make this more obvious on the site itself.

    • By bethekind 2026-03-130:05

      I love the phrase "inference hunger"

  • By jakestevens2 2026-03-140:001 reply

    Since you're using GH200s for these optimizations you're restricted to single device workloads (since GH series are SOC architecture). Kimi K2 (and many other large MoE models) requires multiple devices. Does that mean you can't scale these optimizations to multi-device workloads?

    • By 2uryaa 2026-03-140:54

      Hey Jack, we use GB200s for these workloads. Feel free to check those big models out on our site! We are doing Kimi, GLM, Minimax, etc.

  • By nnx 2026-03-132:111 reply

    Are you `Ionstream` on OpenRouter?

    If so, it would be great to provide more models through OpenRouter. This looks interesting but not enough to make me go through the trouble of setting up a separate account, funding it, etc.

    • By hazelnut 2026-03-133:411 reply

      second that.

      for smaller start ups, it's easier to go through one provider (OpenRouter) instead of having the hassle of managing different endpoints and accounts. you might get access to many more users that way.

      mid to large companies might want to go directly to the source (you) if they want to really optimize the last mile but even that is debatable for many.

      • By vshah1016 2026-03-1318:39

        Hey @nnx & @hazelnut, good question, but no, we're not IonStream on OpenRouter.

        The purpose of IonRouter is to let people publicly see the speed of our engine firsthand. It makes the sales pipeline a lot easier when a prospect can just go try it themselves before committing. Signup is low friction ($10 minimum to load, and we preload $0.10) so you can test right away.

        That said, we do plan to offer this as a usage-based service within our own cloud. We own every layer of the stack— inference engine, GPU orchestration, scheduling, routing, billing, all of it. No third-party inference runtime, no off-the-shelf serving framework. So there's no reason for us to go through a middleman.

        No plans to be an OpenRouter provider right now.

HackerNews