Understanding the Go Runtime: The Scheduler

2026-03-0914:5415440internals-for-interns.com

In the previous article we explored how Go’s memory allocator manages heap memory — grabbing large arenas from the OS, dividing them into spans and size classes, and using a three-level hierarchy…

In the previous article we explored how Go’s memory allocator manages heap memory — grabbing large arenas from the OS, dividing them into spans and size classes, and using a three-level hierarchy (mcache, mcentral, mheap) to make most allocations lock-free. A key detail was that each P (processor) gets its own memory cache. But we never really explained what a P is, or how the runtime decides which goroutine runs on which thread. That’s the scheduler’s job, and that’s what we’re exploring today.

The scheduler is the piece of the runtime that answers a deceptively simple question: which goroutine runs next? You might have hundreds, thousands, or even millions of goroutines in your program, but you only have a handful of CPU cores. The scheduler’s job is to multiplex all those goroutines onto a small number of OS threads, keeping every core busy while making sure no goroutine gets starved.

If you’ve ever used goroutines and channels, you’ve already benefited from the scheduler without knowing it. Every go statement, every channel send and receive, every time.Sleep—they all interact with the scheduler. Let’s see how it works.

Let’s start with the fundamental building blocks — the three structures that the entire scheduler is built around.

The GMP Model

The scheduler is built around three concepts, commonly called the GMP model: G (goroutine), M (machine/OS thread), and P (processor). We touched on these during the bootstrap article, but now let’s look at them properly.

Let’s look at each one.

G — Goroutine

A G is a goroutine — the Go runtime’s representation of a piece of concurrent work. Every time you write go f(), the runtime creates (or reuses) a G to track that function’s execution.

What does a G actually carry? The struct has a lot of fields, but the ones I think are most useful for understanding how it works are: a small stack (starting at just 2KB), some saved registers (stack pointer, program counter, etc.) so the scheduler can pause it and resume it later, a status field that tracks what the goroutine is doing (running, waiting, ready to run), and a pointer to the M currently running it. The full struct in src/runtime/runtime2.go has a lot more — fields for panic and defer handling, GC assist tracking, profiling labels, timers, and more.

Compare that to an OS thread, which typically starts with a 1–8MB stack and carries a lot of kernel state. A goroutine is dramatically lighter — that’s why you can have millions of them in a single program. An OS thread? You’ll start feeling the pressure at a few thousand.

So goroutines are the work. But someone has to actually execute that work — the CPU doesn’t know what a goroutine is. It only knows how to run threads.

M — Machine (OS Thread)

An M (defined in src/runtime/runtime2.go ) is an OS thread — the thing that actually executes code. The scheduler’s job is to put goroutines onto Ms so they can run.

Every M has two goroutine pointers that are worth knowing about. The first is curg — the user goroutine currently running on this thread. That’s your code. The second is g0 — and every M has its own. g0 is a special goroutine that’s reserved for the runtime’s own housekeeping — scheduling decisions, stack management, garbage collection bookkeeping. It has a much larger stack than regular goroutines: typically 16KB, though it can be 32KB or 48KB depending on the OS and whether the race detector is enabled. Unlike regular goroutines, the g0 stack doesn’t grow — it’s fixed at allocation time, so it has to be big enough upfront to handle whatever the runtime needs to do. When the scheduler needs to make a decision (which goroutine to run next, how to handle a blocking operation), it switches from your goroutine to this M’s g0 to do that work. Think of g0 as the M’s “manager mode” — it runs the scheduling logic, then hands control back to a user goroutine.

An M also has a pointer to the P it’s currently attached to. This is important: without a P, an M can’t run Go code. It’s just an idle OS thread sitting there doing nothing. Why does an M need a P at all?

P — Processor

This is the clever part of the design. A P (defined in src/runtime/runtime2.go ) is not a CPU core and it’s not a thread — it’s a scheduling context. Think of it as a workstation: it has everything a goroutine needs to run efficiently, and an M has to sit down at one before it can do any real work.

Why not just let Ms run goroutines directly? The problem is system calls. When an M enters the kernel, the entire OS thread blocks — and if all the scheduling resources were attached to the M, they’d be stuck too. The run queue, the memory cache, everything would be frozen until the syscall returns. By putting all of that on a separate P, the runtime can detach the P from a blocked M and hand it to a free one. The work keeps moving even when a thread is stuck.

So each P carries its own local run queue — a list of up to 256 goroutines that are ready to run. It also has a runnext slot, which is like a fast-pass for the very next goroutine to execute. There’s a gFree list where finished goroutines are kept around so they can be recycled instead of allocated from scratch. It even carries its own mcache — the per-P memory cache we saw in the memory allocator article. And because each P has its own copy of all this stuff, the threads using it don’t need to fight over shared locks all the time — that’s a nice bonus.

The number of Ps is controlled by GOMAXPROCS, which defaults to the number of CPU cores. So on an 8-core machine, you have 8 Ps, meaning at most 8 goroutines can truly run in parallel at any moment. But you can have far more Ms than Ps — some might be blocked in system calls while others are actively running goroutines. The key is that only GOMAXPROCS of them can be running Go code at any given time.

This decoupling is the heart of the scheduler’s design, and we’ll see why it matters so much as we go through the rest of the article.

So we have Gs, Ms, and Ps — but somebody needs to keep track of all of them. That’s the schedt struct.

The Scheduler State (schedt)

The schedt struct (defined in src/runtime/runtime2.go ) is the global scheduler state. There’s exactly one instance of it — a global variable called sched — and it holds everything that doesn’t belong to any specific P or M. Think of it as the shared bulletin board that the Ps and Ms check when they need to coordinate.

What lives there? First, the global run queue (runq) — a linked list of goroutines that aren’t in any P’s local queue. These are goroutines that overflowed from a full local queue, or that came back from a system call and couldn’t find a P. There’s also a global free list (gFree) of dead goroutines waiting to be recycled — when a P’s local free list runs out, it refills from here, and when a P has too many dead goroutines, it dumps some back. The same two-level pattern we saw in the memory allocator: local caches for the fast path, shared pool as backup.

Then there are the idle lists. When a P has no M running it, it goes on the pidle list. When an M has no work and no P, it goes on the midle list and sleeps. The scheduler also tracks how many Ms are currently spinning (looking for work) in nmspinning — we’ll explain what spinning means later in the article — and whether the GC is requesting a stop-the-world pause in gcwaiting. All of this shared state is protected by sched.lock — but the lock is designed to be held very briefly, because the hot path (picking a goroutine from a local queue) doesn’t touch schedt at all.

Beyond schedt, the runtime keeps master lists of every G, M, and P that has ever been created — the global variables allgs, allm, and allp. These aren’t used for scheduling decisions. They exist so the runtime can find everything when it needs to do something global, like scanning all goroutine stacks during garbage collection or checking for stuck system calls in sysmon.

Here’s the full picture:

Go Scheduler Diagram

Now that we’ve set the stage, it’s time to see the actors in action. Let’s follow a goroutine through its lifetime and see how it moves across this battlefield.

The Life of a Goroutine

Let’s follow the life of a goroutine from birth to death — and sometimes back again. The states are defined in src/runtime/runtime2.go , but rather than listing them, let’s walk through the story.

Birth: Creation and First Steps

It starts when you write go f(). The compiler turns this into a call to newproc() (in src/runtime/proc.go ), and the runtime needs a G struct to represent this new goroutine. But it doesn’t necessarily allocate one from scratch — first, it checks the current P’s local free list of dead goroutines. If there’s one available, it gets recycled, stack and all. If the local list is empty, it tries to grab a batch from the global free list in schedt. Only if both are empty does the runtime allocate a new G with a fresh 2KB stack. This reuse is why goroutine creation is so cheap — most of the time, it’s just pulling a G off a list and reinitializing a few fields.

If the G was recycled from the free list, it’s already in _Gdead state — that’s where goroutines go when they finish. If it was freshly allocated, it starts in _Gidle (a blank struct, never used before) and immediately transitions to _Gdead. Either way, the G is in _Gdead before setup begins. Wait — dead already? Yes, but only technically. _Gdead means “not in use by the scheduler” — it’s the state for goroutines that are either being set up or finished and waiting for reuse. The runtime uses it as a safe “parked” state while it configures the G’s internals.

During initialization, the runtime prepares the goroutine so it’s ready to run. It sets the stack pointer to the top of its stack, points the program counter at your function so it knows where to start executing, and places a return address pointing to goexit — the goroutine cleanup handler. This way, when your function finishes and returns, execution naturally lands in goexit without needing any special “is it done?” check.

Once setup is complete, the G moves to _Grunnable and goes into the current P’s runnext slot, displacing whatever was there before. This means the new goroutine will run very soon — right after the current goroutine yields.

Now the goroutine is alive — sitting on a run queue, ready to execute, just waiting for an M to pick it up.

Running

When the scheduler picks this G off the queue, it transitions to _Grunning. This is the active state — the goroutine is executing your code on an M, with a P. This is where it spends its productive time.

But goroutines rarely run straight through to completion. At some point, something will interrupt the flow, and what happens next depends on why the goroutine stopped. This is where the story branches.

Blocking and Unblocking

Maybe the goroutine tries to receive from an empty channel, or acquire a locked mutex, or sleep. Here’s a detail that might surprise you: there’s no external “scheduler thread” that swoops in and parks the goroutine. The goroutine parks itself.

Let’s say your goroutine does <-ch on an empty channel. The channel implementation sees there’s nothing to receive, so it calls gopark() to park the goroutine until a value arrives. The goroutine switches to the g0 stack, changes its own status to _Gwaiting, and adds itself to the channel’s wait queue. After that, it’s gone from the scheduler’s perspective — not on any run queue, just sitting on the channel’s internal wait list. The M doesn’t go to sleep though. It calls schedule() and picks up the next goroutine. From the M’s point of view, one goroutine parked and another one started running — the M stayed busy the whole time.

gopark() also records why the goroutine is blocking — channel receive, mutex lock, sleep, select, and so on. This is what shows up when you look at goroutine dumps or profiling data, so you can tell exactly what each goroutine is waiting for.

Now for the other side: what happens when the thing the goroutine was waiting for finally happens? Say another goroutine sends a value on that channel. The sender finds our goroutine on the channel’s wait queue, copies the value directly to it, and calls goready(). This changes the goroutine’s status back to _Grunnable and places it in the sender’s runnext slot — meaning it’ll run very soon, right after the sender yields. This runnext placement creates a tight back-and-forth between producer and consumer goroutines. G1 sends, G2 receives and runs immediately, G2 sends back, G1 receives and runs immediately — almost like coroutines handing off to each other, with minimal scheduling overhead.

System Calls

Blocking on channels and mutexes is one thing — the goroutine parks, but the M and P stay free. System calls are a different beast, because they block the entire OS thread.

When a goroutine makes a system call — reading a file, accepting a network connection, anything that enters the kernel — the entire OS thread blocks. Before entering the kernel, the goroutine calls entersyscall(), which saves its context and changes its status to _Gsyscall. But here’s an important detail: the M doesn’t give up its P. It keeps it. Why? Because most system calls are fast — a few microseconds — and the goroutine will come back and keep running on the same P as if nothing happened. No locks, no coordination, no overhead.

But as soon as the goroutine is in _Gsyscall, it’s in danger of losing its P. If the system call takes too long, sysmon can come along and retake the P — detach it from the blocked M and hand it to another thread so the goroutines in its run queue keep running. This is where the G-M-P decoupling really pays off: the thread is stuck in the kernel, but the work moves on.

When the system call finishes, the goroutine checks whether it still has its P. If it does — great, keep going. If sysmon took it, the goroutine tries to grab any idle P. And if there are no idle Ps at all, it puts itself on the global run queue and waits to be picked up. We’ll cover sysmon in more detail in a following article.

So far we’ve seen goroutines block voluntarily — on channels, mutexes, and system calls. But there’s something more subtle happening behind the scenes every time a goroutine calls a function.

Stack Growth

There’s another thing that can happen while a goroutine is running: it can run out of stack space. Go goroutines start with a tiny 2KB stack, and unlike OS threads, they don’t get a fixed-size stack upfront. Instead, the compiler inserts a small check called the stack growth prologue at the beginning of most functions. This check compares the current stack pointer against the stack limit — if there’s not enough room for the next function call, the runtime steps in.

When that happens, the runtime allocates a new, larger stack (typically double the size), copies the old stack contents over, adjusts all the pointers that reference stack addresses, and frees the old stack. The goroutine then continues running on its new, bigger stack as if nothing happened. This is what allows Go to run millions of goroutines — they start small and only grow when they actually need the space.

This stack check is worth mentioning here because, as we’ll see in the next section, the scheduler piggybacks on it for cooperative preemption.

Preemption

The goroutine might also be stopped involuntarily. Everything we’ve seen so far — blocking on channels, making system calls, finishing — involves the goroutine cooperating. But what if a goroutine never yields? A tight computational loop without any function calls, channel operations, or memory allocations would never give the scheduler a chance to run anything else on that P.

Go has two answers. The first is cooperative preemption: the compiler inserts a small check at the beginning of most functions that tests whether the goroutine has been asked to yield. When the runtime wants to preempt a goroutine, it flips a flag, and the next function call triggers the check and hands control back to the scheduler. This is cheap — it reuses the stack growth check that’s already there — but it only works at function calls.

The second is asynchronous preemption: for goroutines stuck in tight loops with no function calls, the runtime sends an OS signal (SIGURG on Unix) directly to the thread. The signal handler interrupts the goroutine, saves its context, and yields to the scheduler. This is the heavy hammer — it works even when cooperative preemption can’t.

In both cases, the preempted goroutine transitions directly to _Grunnable and goes back on a run queue — it’ll get another chance to run soon. There’s also a special _Gpreempted state, but that’s reserved for when the GC or debugger needs to fully suspend a goroutine via suspendG. In either case, it’s sysmon that detects goroutines running too long (more than 10ms) and triggers the preemption. We’ll explore the details in the system monitor article.

Death and Recycling

Finally, the goroutine’s function returns. Remember that the PC was set up to point at goexit during creation? So the return falls through to goexit0(), and the goroutine handles its own death. It changes its own status to _Gdead, cleans up its fields, drops the M association, and puts itself on the P’s free list. Then it calls schedule() to find the next goroutine for this M.

The G isn’t freed or garbage collected. It sits on the free list, stack and all, waiting to be recycled. This is a key optimization — allocating and setting up a new G is much more expensive than reinitializing a dead one. And this is where the story comes full circle: a new go statement might pull this same G off the free list, reinitialize it, and send it through the whole journey again.

The Self-Service Pattern

There’s a pattern running through all of these stages: the goroutine is always the one doing the work of its own state transitions. There’s no central scheduler thread pulling the strings — the goroutine parks itself, adds itself to wait queues, cleans itself up, and invokes the scheduler to pick the next G. The scheduler is really just a set of functions that goroutines call on themselves, using the M’s g0 stack to do the bookkeeping.

Most goroutines spend their lives bouncing between _Grunnable, _Grunning, and _Gwaiting — ready, running, waiting, ready, running, waiting — until they finally finish and return to _Gdead.

With the data structures and states in place, let’s look at the core algorithm — the loop that drives everything.

The Scheduling Loop

Now for the heart of the scheduler: the schedule() function (in src/runtime/proc.go ). This is a loop that runs on every M, on the g0 stack, and its job is simple: find a runnable goroutine and execute it. When the goroutine stops running (it blocks, finishes, or gets preempted), control returns to schedule(), and the loop starts again.

Here’s the rough shape:

Go Scheduler Loop

The goroutine runs until it yields control back to the scheduler—either voluntarily (by blocking on a channel, calling runtime.Gosched(), etc.) or involuntarily (via preemption). Then we’re back at schedule(), looking for the next goroutine.

The schedule() function itself is straightforward. It checks a few special cases (is this M locked to a specific goroutine?), and then calls findRunnable() to get the next goroutine. Once it has one, it calls execute() to run it.

The interesting part is findRunnable()—that’s where all the decisions happen. Let’s break down exactly how it searches for work.

Finding Work: The Search Order

findRunnable() (in src/runtime/proc.go ) is the function that answers “what should I run next?” It searches multiple sources in a specific order, and it keeps looking until it finds something — if there’s truly nothing to do, it parks the M to sleep until work appears, and then resumes the search.

Here’s the search order:

1. GC and Trace Work

Before looking for user goroutines, the scheduler checks if there’s runtime work to do. If the GC is active and needs a mark worker, that takes priority. If execution tracing is enabled and its reader goroutine is ready, that also takes priority. The runtime’s own needs come first.

2. The Global Queue Fairness Check

Every 61st schedule call, the scheduler grabs a single goroutine from the global run queue before looking at the local queue. Why 61? It’s a prime number, which helps avoid synchronization patterns where the check always lines up with the same goroutine. The point is to prevent starvation: if goroutines are constantly being added to local queues, the ones sitting in the global queue could wait forever without this check.

3. The Local Run Queue

This is the fast path, and where most goroutines come from. The scheduler first checks the runnext slot—a priority position that holds the single goroutine most likely to run next. If runnext is set, the goroutine gets it and inherits the current time slice, meaning it doesn’t reset the scheduling tick. This is an optimization for producer-consumer patterns: if G1 sends on a channel and wakes G2, G2 goes into runnext and runs immediately, almost like a direct handoff.

If runnext is empty, the scheduler takes from the ring buffer—a lock-free circular queue of up to 256 goroutines. Only the owning M writes to this queue (single producer), so no locks are needed for the common case.

4. The Global Run Queue (Again)

If the local queue is empty, check the global queue. This time, instead of grabbing just one goroutine, the scheduler grabs a batch. This amortizes the cost of acquiring the global lock (sched.lock). One lock acquisition, many goroutines.

5. Network Polling

Before resorting to stealing, the scheduler checks the netpoller to see if any network I/O is ready. If any goroutines were blocked waiting for network operations and those operations are now complete, those goroutines become runnable. We’ll talk about how the netpoller works in a future article.

6. Work Stealing

If all the above came up empty, it’s time to steal. The scheduler looks at other Ps’ local queues and takes half of their goroutines. This is the mechanism that keeps all cores busy even when work is unevenly distributed.

7. Last Resort: Park

If there’s truly nothing to do anywhere—no local work, no global work, no network I/O, nothing to steal—the M releases its P, puts it on the idle P list, and parks itself to sleep. It will be woken up later when new work appears.

But that “parking” decision isn’t as straightforward as it sounds. Should a thread go to sleep the moment it runs out of work, or should it hang around for a bit in case something shows up?

Spinning Threads

There’s a subtle balance to strike here. When a thread runs out of work — its local queue is empty, there’s nothing to steal — should it go to sleep immediately? If it does, and new work arrives a microsecond later, there’s nobody awake to pick it up. Another thread has to be woken from sleep, which costs time. On the other hand, if too many idle threads stay awake burning CPU cycles looking for work that isn’t there, that’s pure waste.

Go’s answer is spinning threads. When an M runs out of work, it doesn’t park right away. Instead, it enters a spinning state — actively checking queues and trying to steal — for a brief period before giving up and going to sleep. The runtime limits the number of spinners to at most half the number of busy Ps — so on an 8-core machine with 6 busy Ps, up to 3 threads can spin at once. Enough to be responsive, not so many that they waste CPU.

The other side of the coin is when new work appears — say a new goroutine is created or a channel unblocks. The runtime is even more conservative here: it only wakes up a sleeping thread if there are zero spinners. If there’s already a spinning thread out there, it’ll pick up the new work. The goal is simple: always have someone ready to grab new work, but not too many someones.

All of these mechanisms — blocking, unblocking, system calls, preemption — involve switching from one goroutine to another. Let’s look at what that switch actually costs.

Context Switching

Let’s talk briefly about what happens during a goroutine context switch, because it’s what makes the whole system fast.

When the scheduler switches from one goroutine to another, it needs to save where the current goroutine was and restore where the next one left off. The good news is that a goroutine’s state is surprisingly small. The mcall() assembly function only saves 3 values — the stack pointer, the program counter, and the base pointer — into a tiny gobuf struct. That’s it. Why so few? Because goroutine switches happen at function call boundaries, and at those points the compiler has already spilled any important registers to the stack following normal calling conventions. The switch only needs to save enough to find the stack again.

gogo() does the opposite: it restores those saved values and jumps right into the goroutine. Together, mcall() and gogo() are the mechanism behind every voluntary goroutine switch. For async preemption (where the goroutine is interrupted mid-execution by a signal), the full register set has to be saved — but that’s the exception, not the common path.

And it’s fast. A goroutine context switch takes roughly 50–100 nanoseconds — about 200 CPU cycles. Compare that to an OS thread context switch, which involves saving the full register set and switching kernel stacks — that costs 1–2 microseconds, 10 to 40 times slower. This is a big part of why goroutines scale so much better than threads.

Let’s wrap up what we’ve learned.

The Go scheduler multiplexes goroutines onto OS threads using the GMP model: Gs (goroutines) are the work, Ms (OS threads) provide the execution, and Ps (processors) carry the scheduling context — local run queues, memory caches, and everything needed to run goroutines efficiently. The global schedt struct ties it all together with shared state like the global run queue, idle lists, and the spinning thread count.

We followed a goroutine through its whole life — from creation (recycling dead Gs when possible), through running, blocking (where the goroutine parks itself), system calls (where the P detaches so other goroutines keep running), stack growth, and preemption (both cooperative and asynchronous). At the end, the goroutine cleans up after itself and goes back on the free list for reuse.

The scheduling loop in schedule() and findRunnable() drives it all — checking the local queue, the global queue for fairness every 61 ticks, the netpoller, and stealing from other Ps before giving up. Spinning threads keep the system responsive by staying awake briefly to catch new work, and context switching between goroutines costs only about 50–100 nanoseconds thanks to the small amount of state involved.

If you want to explore the implementation yourself, the main scheduler code lives in src/runtime/proc.go , with data structures in src/runtime/runtime2.go and assembly routines in src/runtime/asm_*.s.

In the next article, we’ll look at the garbage collector — how it tracks which objects are still alive and reclaims the rest, all while your program keeps running.


Read the original article

Comments

  • By pss314 2026-03-130:492 reply

    I enjoyed both these GopherCon talks:

    GopherCon 2018: The Scheduler Saga - Kavya Joshi https://www.youtube.com/watch?v=YHRO5WQGh0k

    GopherCon 2017: Understanding Channels - Kavya Joshi https://www.youtube.com/watch?v=KBZlN0izeiY

    • By c0balt 2026-03-131:121 reply

      https://m.youtube.com/watch?v=-K11rY57K7k - Dmitry Vyukov — Go scheduler: Implementing language with lightweight concurrency

      This one notably also explains the design considerations for golangs M:N:P in comparison to other schemes and which specific challenges it tries to address.

      • By konart 2026-03-1313:09

        And by Dmitry himself.

    • By jvillegasd 2026-03-131:05

      Good videos, thanks for sharing!

  • By withinboredom 2026-03-137:557 reply

    My biggest issue with go is it’s incredibly unfair scheduler. No matter what load you have, P99 and especially P99.9 latency will be higher than any other language. The way that it steals work guarantees that requests “in the middle” will be served last.

    It’s a problem that only go can solve, but that means giving up some of your speed that are currently handled immediately that shouldn’t be. So overall latency will go up and P99 will drop precipitously. Thus, they’ll probably never fix it.

    If you have a system that requires predictable latency, go is not the right language for it.

    • By mknyszek 2026-03-1314:33

      > Thus, they’ll probably never fix it.

      I'm sorry you had a bad experience with Go. What makes you say this? Have you filed an issue upstream yet? If not, I encourage you to do so. I can't promise it'll be fixed or delved into immediately, but filing detailed feedback like this is really helpful for prioritizing work.

    • By melodyogonna 2026-03-139:153 reply

      > If you have a system that requires predictable latency, go is not the right language for it.

      Having a garbage collector already make this the case, it is a known trade off.

    • By pjmlp 2026-03-138:331 reply

      It misses having a custom scheduler option, like Java and .NET runtimes offer, unfortunely that is too many knobs for the usual Go approach to language design.

      Having a interface for how it is supposed to behave, a runtime.SetScheduler() or something, but it won't happen.

      • By MisterTea 2026-03-1318:04

        I find it hard to believe the people who built Go, coming from designing Plan 9 and Inferno, would build a language where it is difficult to swap out a component.

        I have this feeling that in their quest to make Go simple, they added complexity in other areas. Then again, this was built at Google, not Bell Labs so the culture of building absurdly complex things likely influenced this.

    • By red_admiral 2026-03-137:581 reply

      > If you have a system that requires predictable latency, go is not the right language for it.

      I presume that's by design, to trade off against other things google designed it for?

      • By withinboredom 2026-03-138:00

        No clue. All I know is that people complain about it every time they benchmark.

    • By kjksf 2026-03-1311:342 reply

      > No matter what load you have, P99 and especially P99.9 latency will be higher than any other language

      I strongly call BS on that.

      Strong claim and evidence seems to be a hallucination in your own head.

      There are several writeups of large backends ported from node/python/ruby to Go which resulted in dramatic speedups, including drop in P99 and P99.9 latencies by 10x

      That's empirical evidence your claim is BS.

      What exactly is so unfair about Go scheduler and what do you compare it to?

      Node's lack of multi-threading?

      Python's and Ruby's GIL?

      Just leaving this to OS thread scheduler which, unlike Go, has no idea about i/o and therefore cannot optimize for it?

      Apparently the source of your claim is https://github.com/php/frankenphp/pull/2016

      Which is optimizing for a very specific micro-benchmark of hammering std-lib http server with concurrent request. Which is not what 99% of go servers need to handle. And is exercising way more than a scheduler. And is not benchmarking against any other language, so the sweeping statement about "higher than any other language" is literally baseless.

      And you were able to make a change that trades throughput for P99 latency without changing the scheduler, which kind of shows it wasn't the scheduler but an interaction between a specific implementation of HTTP server and Go scheduler.

      And there are other HTTP servers in Go that focus on speed. It's just 99.9% of Go servers don't need any of that because the baseline is 10x faster than python/ruby/javascript and on-par with Java or C#.

      • By jerf 2026-03-1315:31

        "There are several writeups of large backends ported from node/python/ruby to Go which resulted in dramatic speedups, including drop in P99 and P99.9 latencies by 10x"

        But that's not comparing apples to apples. When you get a dramatic speedup, you will also see big drops in the P99 and P99.9 latencies because what stressed out the scripting language is a yawn to a compiled language. Just going from stressed->yawning will do wonders for all your latencies, tail latencies included.

        That doesn't say anything about what will happen when the load increases enough to start stressing the compiled language.

      • By withinboredom 2026-03-1314:091 reply

        Do I need to share the TLA+ spec that shows its unfair? Or do you have any actual proof to your claims?

        • By 9rx 2026-03-1315:231 reply

          It would be helpful for you to share a link to the Github issue you created. If the TLA+ spec you no doubt put a lot of time into creating is contained there, that would be additionally amazing, but more relevant will be the responses from the maintainers so that we're not stuck with one side of the story.

          Of course, expecting you to provide the link would be incredibly onerous. We can look it up ourselves just as easy as you can. Well, in theory we can. The only trouble is that I cannot find the issue you are talking about. I cannot find any issues in the Go issue tracker from your account.

          So, in the interest of good faith, perhaps you can help us out this one time and point us in the right direction?

          • By withinboredom 2026-03-1316:491 reply

            I’m not interested in contributing to go. I tried once, was basically ignored. I have contributed to issues there where it has impacted projects I’ve worked on. But even then, it didn’t feel collaborative; mostly felt like dealing with a tech support team instead of other developers.

            That being said, I love studying go and learning how to use it to the best of my ability because I work on sub-ųs networking in go.

            When I get home, I’ll dig it up. But if you think it’s a fair scheduler, I invite you to just think about it on a whiteboard for a few minutes. It’s nowhere near fair and should be self-evident from first principles alone.

            • By withinboredom 2026-03-1317:081 reply

              Here’s a much better write up than I’m willing to do: https://www.cockroachlabs.com/blog/rubbing-control-theory/

              There are also multiple issues about this on GitHub.

              And an open issue that is basically been ignored. golang/go#51071

              Like I said. Go won’t fix this because they’ve optimized for throughput at the expense of everything else, which means higher tail latencies. They’d have to give up throughput for lower latency.

              • By 9rx 2026-03-1317:201 reply

                > And an open issue that is basically been ignored. golang/go#51071

                It doesn't look ignored to me. It explains that the test coverage is currently poor, so they are in a terrible position of not being able to make changes until that is rectified.

                The first step is to improve the test coverage. Are you volunteering? AI isn't at a point where it is going to magically do it on its own, so it is going to take a willing human hand. You do certainly appear to be the perfect candidate, both having the technical understanding and the need for it.

                • By withinboredom 2026-03-1318:271 reply

                  Heh. I've had my fair share of mailing list drama. This is political AND technical. Someone saying "let’s cut throughput" is going to get shot down fast, no matter the technical merit. If someone with the political clout were to be willing to champion the work and guide the discussion appropriately while someone like me does the work, that's different. That's at least how things like this are done in other communities, unless go is different.

                  • By 9rx 2026-03-1319:251 reply

                    > If someone with the political clout were to be willing to champion the work and guide the discussion appropriately while someone like me does the work, that's different.

                    There is unlikely anyone on the Go team with more political clout in this particular area than the one who has already reached out to you. You obviously didn't respond to him publicly, but did he reject your offer in private? Or are you just imaging some kind of hypothetical scenario where they are refusing to talk to you, despite evidence to the contrary?

                    • By withinboredom 2026-03-1322:021 reply

                      > You obviously didn't respond to him publicly, but did he reject your offer in private?

                      I literally have no idea what you're talking about here.

                      • By 9rx 2026-03-1322:341 reply

                        You must not have read all the comments yet? One of Go's key runtime maintainers sent you a message. Now is your opportunity to give him your plan so that he can give you the political support you seek.

                        • By withinboredom 2026-03-1322:371 reply

                          I still have no idea what you are talking about.

                          • By 9rx 2026-03-140:47

                            I thought it was a simple question. You don't know if you have read the comments or not?

    • By pothamk 2026-03-1311:37

      [dead]

    • By desdenova 2026-03-139:47

      > If you have a system, go is not the right language for it.

      FTFY

  • By Someone 2026-03-1312:571 reply

    > a goroutine’s state is surprisingly small. The mcall() assembly function only saves 3 values — the stack pointer, the program counter, and the base pointer — into a tiny gobuf struct. That’s it. Why so few? Because goroutine switches happen at function call boundaries, and at those points the compiler has already spilled any important registers to the stack following normal calling conventions.

    Wouldn’t that mean go never uses registers to pass arguments to functions?

    If so, that seems in conflict with https://go.dev/src/cmd/compile/abi-internal#function-call-ar..., which says “Because access to registers is generally faster than access to the stack, arguments and results are preferentially passed in registers”

    Or does the compiler always Go’s stable ABI, known as ABI0 in functions where it inserts code to potentially context switch, and only uses the (potentially) faster ABI that passes arguments in registers elsewhere?

    • By mknyszek 2026-03-1314:21

      The compiler generates code to spill arguments to the stack at synchronous preemption points (function entry). Signal-based preemption has a spill path that saves the full ABI register set.

HackerNews