
Cumulus LabsIonAttention Engine

Our custom inference stack multiplexes models on a single GPU, swaps in ms, and adapts to traffic in real time. Built from the ground up for Grace Hopper.
Deploy your finetunes, custom LoRAs, or any open-source model on our fleet. Dedicated GPU streams with no cold starts and per-second billing.
Teams use Ion for highest performance robotics perception, multi-camera surveillance, game asset generation, and AI video pipelines.
5 VLMs, 1 GPU.
Five vision-language models on a single GPU — 2,700 video clips, concurrent users, <1s cold starts.
Read the case studyPoint your existing OpenAI client at Ion. Any language, any framework. One line change.
Pay per million tokens. No idle costs.
ZhiPu AI's flagship 600B+ MoE model with state-of-the-art reasoning, coding, and multilingual capabilities, powered by EAGLE speculative decoding on 8x B200 GPUs.
MoonShot AI's frontier reasoning model designed for long document understanding, multi-step reasoning chains, and complex problem decomposition across technical and scientific domains.
MiniMax's flagship 1M-context language model delivering strong reasoning and instruction following across long documents, multi-turn dialogue, and complex analysis.
Cumulus's most capable open-source model — a 122B MoE with 10B active parameters rivaling leading proprietary models on coding, reasoning, and multilingual benchmarks.
A frontier open-source 120B model delivering cutting-edge reasoning and instruction following comparable to leading closed-source systems, ideal for complex agentic workflows and advanced code generation.
A 14B text-to-video model optimized for speed via the FastGen runtime, generating clips in under 10 seconds with strong motion coherence.
Black Forest Labs' fastest Flux model, delivering crisp sub-4-second image generation ideal for real-time applications, prototyping, and high-volume pipelines.
As an inference hungry human, I am obviously hooked. Quick feedback:
1. The models/pricing page should be linked from the top perhaps as that is the most interesting part to most users. You have mentioned some impressive numbers (e.g. GLM5~220 tok/s $1.20 in · $3.50 out) but those are way down in the page and many would miss it
2. When looking for inference, I always look at 3 things: which models are supported, at which quantization and what is the cached input pricing (this is way more important than headline pricing for agentic loops). You have the info about the first on the site but not 2 and 3. Would definitely like to know!
Thank you for the feedback! I think we will definitely redo the info on the frontpage to reorg and show quantizations better. For reference, Kimi and Minimax are NVFP4. The rest are FP8. But I will make this more obvious on the site itself.
I love the phrase "inference hunger"
Since you're using GH200s for these optimizations you're restricted to single device workloads (since GH series are SOC architecture). Kimi K2 (and many other large MoE models) requires multiple devices. Does that mean you can't scale these optimizations to multi-device workloads?
Hey Jack, we use GB200s for these workloads. Feel free to check those big models out on our site! We are doing Kimi, GLM, Minimax, etc.
Are you `Ionstream` on OpenRouter?
If so, it would be great to provide more models through OpenRouter. This looks interesting but not enough to make me go through the trouble of setting up a separate account, funding it, etc.
second that.
for smaller start ups, it's easier to go through one provider (OpenRouter) instead of having the hassle of managing different endpoints and accounts. you might get access to many more users that way.
mid to large companies might want to go directly to the source (you) if they want to really optimize the last mile but even that is debatable for many.
Hey @nnx & @hazelnut, good question, but no, we're not IonStream on OpenRouter.
The purpose of IonRouter is to let people publicly see the speed of our engine firsthand. It makes the sales pipeline a lot easier when a prospect can just go try it themselves before committing. Signup is low friction ($10 minimum to load, and we preload $0.10) so you can test right away.
That said, we do plan to offer this as a usage-based service within our own cloud. We own every layer of the stack— inference engine, GPU orchestration, scheduling, routing, billing, all of it. No third-party inference runtime, no off-the-shelf serving framework. So there's no reason for us to go through a middleman.
No plans to be an OpenRouter provider right now.