Speed up responses with fast mode

2026-02-0718:08223228code.claude.com

Get faster Opus 4.6 responses in Claude Code by toggling fast mode.

Fast mode is in research preview. The feature, pricing, and availability may change based on feedback.

Fast mode delivers faster Opus 4.6 responses at a higher cost per token. Toggle it on with /fast when you need speed for interactive work like rapid iteration or live debugging, and toggle it off when cost matters more than latency. Fast mode is not a different model. It uses the same Opus 4.6 with a different API configuration that prioritizes speed over cost efficiency. You get identical quality and capabilities, just faster responses. What to know:
  • Use /fast to toggle on fast mode in Claude Code CLI. Also available via /fast in Claude Code VS Code Extension.
  • Fast mode for Opus 4.6 pricing starts at $30/150 MTok. Fast mode is available at a 50% discount for all plans until 11:59pm PT on February 16.
  • Available to all Claude Code users on subscription plans (Pro/Max/Team/Enterprise) and Claude Console.
  • For Claude Code users on subscription plans (Pro/Max/Team/Enterprise), fast mode is available via extra usage only and not included in the subscription rate limits.
This page covers how to toggle fast mode, its cost tradeoff, when to use it, requirements, and rate limit behavior. Toggle fast mode in either of these ways:
  • Type /fast and press Tab to toggle on or off
  • Set "fastMode": true in your user settings file
Fast mode persists across sessions. For the best cost efficiency, enable fast mode at the start of a session rather than switching mid-conversation. See understand the cost tradeoff for details. When you enable fast mode:
  • If you’re on a different model, Claude Code automatically switches to Opus 4.6
  • You’ll see a confirmation message: “Fast mode ON”
  • A small icon appears next to the prompt while fast mode is active
  • Run /fast again at any time to check whether fast mode is on or off
When you disable fast mode with /fast again, you remain on Opus 4.6. The model does not revert to your previous model. To switch to a different model, use /model. Fast mode has higher per-token pricing than standard Opus 4.6:
ModeInput (MTok)Output (MTok)
Fast mode on Opus 4.6 (<200K)$30$150
Fast mode on Opus 4.6 (>200K)$60$225
Fast mode is compatible with the 1M token extended context window. When you switch into fast mode mid-conversation, you pay the full fast mode uncached input token price for the entire conversation context. This costs more than if you had enabled fast mode from the start. Fast mode is best for interactive work where response latency matters more than cost:
  • Rapid iteration on code changes
  • Live debugging sessions
  • Time-sensitive work with tight deadlines
Standard mode is better for:
  • Long autonomous tasks where speed matters less
  • Batch processing or CI/CD pipelines
  • Cost-sensitive workloads
Fast mode and effort level both affect response speed, but differently:
SettingEffect
Fast modeSame model quality, lower latency, higher cost
Lower effort levelLess thinking time, faster responses, potentially lower quality on complex tasks
You can combine both: use fast mode with a lower effort level for maximum speed on straightforward tasks. Fast mode requires all of the following:
  • Not available on third-party cloud providers: fast mode is not available on Amazon Bedrock, Google Vertex AI, or Microsoft Azure Foundry. Fast mode is available through the Anthropic Console API and for Claude subscription plans using extra usage.
  • Extra usage enabled: your account must have extra usage enabled, which allows billing beyond your plan’s included usage. For individual accounts, enable this in your Console billing settings. For Teams and Enterprise, an admin must enable extra usage for the organization.

Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.

  • Admin enablement for Teams and Enterprise: fast mode is disabled by default for Teams and Enterprise organizations. An admin must explicitly enable fast mode before users can access it.

If your admin has not enabled fast mode for your organization, the /fast command will show “Fast mode has been disabled by your organization.”

Admins can enable fast mode in: Fast mode has separate rate limits from standard Opus 4.6. When you hit the fast mode rate limit or run out of extra usage credits:
  1. Fast mode automatically falls back to standard Opus 4.6
  2. The icon turns gray to indicate cooldown
  3. You continue working at standard speed and pricing
  4. When the cooldown expires, fast mode automatically re-enables
To disable fast mode manually instead of waiting for cooldown, run /fast again. Fast mode is a research preview feature. This means:
  • The feature may change based on feedback
  • Availability and pricing are subject to change
  • The underlying API configuration may evolve
Report issues or feedback through your usual Anthropic support channels.

Read the original article

Comments

  • By zhyder 2026-02-0723:253 reply

    So 2.5x the speed at 6x the price [1].

    Quite a premium for speed. Especially when Gemini 3 Pro is 1.8x the tokens/sec speed (of regular-speed Opus 4.6) at 0.45x the price [2]. Though it's worse at coding, and Gemini CLI doesn't have the agentic strength of Claude Code, yet.

    [1] - https://x.com/claudeai/status/2020207322124132504 [2] - https://artificialanalysis.ai/leaderboards/models

    • By gpm 2026-02-082:151 reply

      6x price/token, so 15x price/second, and only at the API pricing level, not the far cheaper (per token) subscription pricing.

      Definitely an interesting way to encourage whales to spend a lot of money quickly.

      • By atonse 2026-02-085:331 reply

        I didn’t quite understand why they were randomly giving people $50 in credits. But I think this is why?

        • By stingraycharles 2026-02-089:131 reply

          no, it’s for Max subscribers to enable “use API when running out of session limit”. the assumption (probably) being that many will forget to turn it off, and they’ll earn it back that way.

          • By Loic 2026-02-0811:38

            This was my first thought, but by default, you have no automatic reload of your prepaid account. Which I think is for once user friendly. They could have applied a dark pattern here.

    • By sebmellen 2026-02-081:55

      Gemini is pretty good for frontend tasks

    • By deaux 2026-02-083:261 reply

      > Though it's worse at coding, and Gemini CLI doesn't have the agentic strength of Claude Code, yet.

      You can use OpenCode instead of Gemini CLI.

  • By OtherShrezzing 2026-02-0723:525 reply

    A useful feature would be slow-mode which gets low cost compute on spot pricing.

    I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.

    • By spondyl 2026-02-082:041 reply

      https://platform.claude.com/docs/en/build-with-claude/batch-...

      > The Batches API offers significant cost savings. All usage is charged at 50% of the standard API prices.

      • By jaytaylor 2026-02-082:371 reply

        Can this work for Claude? I think it might be raw API only.

        • By spondyl 2026-02-084:271 reply

          I'm not sure I understand the question? Are you perhaps asking if messages can be batched via Claude Code and/or the Claude web UI?

    • By stavros 2026-02-080:261 reply

      OpenAI offers that, or at least used to. You can batch all your inference and get much lower prices.

      • By airspresso 2026-02-0810:31

        Still do. Great for workloads where it's okay to bundle a bunch of requests and wait some hours (up to 24h, usually done faster) for all of them to complete.

    • By mrklol 2026-02-089:51

      Yep same, I often think why this isn’t a thing yet. Running some tasks in the night at e.g. 50% of the costs - there’s the batch api but that is not integrated in e.g. claude code

    • By gardnr 2026-02-088:27

      The discount MAX plans are already on slow-mode.

    • By guerrilla 2026-02-080:033 reply

      > I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.

      If it's not time sensitive, why not just run it at on CPU/RAM rather than GPU.

      • By weird-eye-issue 2026-02-080:421 reply

        Yeah just run a LLM with over 100 billion parameters on a CPU.

        • By kristjansson 2026-02-080:501 reply

          200 GB is an unfathomable amount of main memory for a CPU

          (with apologies for snark,) give gpt-oss-120b a try. It’s not fast at all, but it can generate on CPU.

          • By awestroke 2026-02-087:271 reply

            But it's incredibly incapable compared to SOTA models. OP wants high quality output but doesn't need it fast. Your suggestion would mean slow AND low quality output.

            • By kristjansson 2026-02-0816:13

              Set your parameters to make that point then. “Yeah just run a 1T+ model on CPU”

      • By bethekidyouwant 2026-02-080:121 reply

        Run what exactly?

        • By all2 2026-02-080:392 reply

          I'm assuming GP means 'run inference locally on GPU or RAM'. You can run really big LLMs on local infra, they just do a fraction of a token per second, so it might take all night to get a paragraph or two of text. Mix in things like thinking and tool calls, and it will take a long, long time to get anything useful out of it.

          • By hxtk 2026-02-087:30

            I’ve been experimenting with this today. I still don’t think AI is a very good use of my programming time… but it’s a pretty good use of my non-programming time.

            I ran OpenCode with some 30B local models today and it got some useful stuff done while I was doing my budget, folding laundry, etc.

            It’s less likely to “one shot” apples to apples compared to the big cloud models; Gemini 3 Pro can one shot reasonably complex coding problems through the chat interface. But through the agent interface where it can run tests, linters, etc. it does a pretty good job for the size of task I find reasonable to outsource to AI.

            This is with a high end but not specifically AI-focused desktop that I mostly built with VMs, code compilation tasks, and gaming in mind some three years ago.

          • By guerrilla 2026-02-084:082 reply

            Yes, this is what I meant. People are running huge models at home now, I assumed people could do it on premises or in a data center if you're a business, presumably faster... but yeah it definitely depends on what time scales we're talking.

            • By copperx 2026-02-087:23

              I'd love to know what kind of hardware would it take to do inference at the speed provided by the frontier model providers (assuming their models were available for local use).

              10k worth of hardware? 50k? 100k?

              Assuming a single user.

            • By HumanOstrich 2026-02-087:22

              Huge models? First you have to spend $5k-$10k or more on hardware. Maybe $3k for something extremely slow (<1 tok/sec) that is disk-bound. So that's not a great deal over batch API pricing for a long, long time.

              Also you still wouldn't be able to run "huge" models at a decent quantization and token speed. Kimi K2.5 (1T params) with a very aggressive quantization level might run on one Mac Studio with 512GB RAM at a few tokens per second.

              To run Kimi K2.5 at an acceptable quantization and speed, you'd need to spend $15k+ on 2 Mac Studios with 512GB RAM and cluster them. Then you'll maybe get 10-15 tok/sec.

      • By gruez 2026-02-082:191 reply

        Does that even work out to be cheaper, once you factor in how much extra power you'd need?

        • By HumanOstrich 2026-02-087:19

          How much extra power do you think you would need to run an LLM on a CPU (that will fit in RAM and be useful still)? I have a beefy CPU and if I ran it 24/7 for a month it would only cost about $30 in electricity.

  • By Nition 2026-02-0719:272 reply

    Note that you can't use this mode to get the most out of a subscription - they say it's always charged as extra usage:

    > Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.

    Although if you visit the Usage screen right now, there's a deal you can claim for $50 free extra usage this month.

    • By esperent 2026-02-083:361 reply

      So it's basically useless then. Even with Claude Max I have to manage my usage when doing TDD, and using ccusage tool I've seen that I'd frequently hit $200 per day if I was on the API. At 6x cost you'll burn through $50 in about 20 minutes. I wish that was hyperbole.

      • By andersa 2026-02-086:341 reply

        I tried casually using it for two hours and it burned $100 at the current 50% discounted rate, so your guess is pretty accurate...

        • By copperx 2026-02-087:261 reply

          I still don't get why Claude is so expensive.

          • By airspresso 2026-02-0810:35

            Because we all prefer it over Gemini and Codex. Anthropic knows that and needs to get as much out of it as possible while they can. Not saying the others will catch up soon. But at some point other models will be as capable as Opus and Sonnet are now, and then it's easier to let price guide the choice of provider.

HackerNews