Unlocking Python's Cores:Energy Implications of Removing the GIL

2026-03-068:41132105arxiv.org

Python's Global Interpreter Lock prevents execution on more than one CPU core at the same time, even when multiple threads are used. However, starting with Python 3.13 an experimental build allows…

Show article

View PDF HTML (experimental)

Abstract:Python's Global Interpreter Lock prevents execution on more than one CPU core at the same time, even when multiple threads are used. However, starting with Python 3.13 an experimental build allows disabling the GIL. While prior work has examined speedup implications of this disabling, the effects on energy consumption and hardware utilization have received less attention. This study measures execution time, CPU utilization, memory usage, and energy consumption using four workload categories: NumPy-based, sequential kernels, threaded numerical workloads, and threaded object workloads, comparing GIL and free-threaded builds of Python 3.14.2. The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption. Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention. Across all workloads, energy consumption is proportional to execution time, indicating that disabling the GIL does not significantly affect power consumption, even when CPU utilization increases. When it comes to memory, the no-GIL build shows a general increase, more visible in virtual memory than in physical memory. This increase is primarily attributed to per-object locking, additional thread-safety mechanisms in the runtime, and the adoption of a new memory allocator.
These findings suggest that Python's no-GIL build is not a universal improvement. Developers should evaluate whether their workload can effectively benefit from parallel execution before adoption.

From: Jose Montoya [view email]
[v1] Thu, 5 Mar 2026 04:01:30 UTC (166 KB)

Read the original article

Comments

By devrimozcay 2026-03-0910:367 reply

One thing I'm curious about here is the operational impact.

In production systems we often see Python services scaling horizontally because of the GIL limitations. If true parallelism becomes common, it might actually reduce the number of containers/services needed for some workloads.

But that also changes failure patterns — concurrency bugs, race conditions, and deadlocks might become more common in systems that were previously "protected" by the GIL.

It will be interesting to see whether observability and incident tooling evolves alongside this shift.

By kevincox 2026-03-0915:021 reply

This is surely why Facebook was interested in funding this work. It is common to have N workers or containers of Python because you are generally restricted to one CPU core per Python process (you can get a bit higher if you use libs that unlock the GIL for significant work). So the only scaling option is horizontal because vertical scaling is very limited. The main downside of this was memory usage. You would have to load all of your code and libraries N types and in-process caches would become less effective. So by being able to vertically scale a Python process much further you can run less and save a lot of memory.

Generally speaking the optimal horizontal scaling is as little as you have to. You may want a bit of horizontal scaling for redundancy and geo distribution, but past that vertically scaling to fewer larger process tend to be more efficient, easier to load balance and a handful of other benefits.

By philsnow 2026-03-0917:252 reply

> The main downside of this was memory usage. You would have to load all of your code and libraries N types and in-process caches would become less effective.

You can load modules and then fork child processes. Children will share memory with each other (if they need to modify any shared memory, they get copy-on-write pages allocated by the kernel) and you'll save quite a lot on memory.

By kevincox 2026-03-0917:29

Yes, this can help a lot, but it definitely isn't perfect. Especially since CPython uses reference counting it is likely that many pages get modified relatively quickly as they are accessed. Many other GC strategies are also pretty hostile to CoW memory (for example mark bits, moving, ...) Additionally this doesn't help for lazy loaded data and caches in code and libraries.

By cma 2026-03-0923:24

Every python object will trigger copy on write of a full memory page on any read, due to reference counting, though some will share pages.

By LtWorf 2026-03-0915:592 reply

But python can fork itself and run multiple processes into one single container. Why would there be a need to run several containers to run several processes?

There's even the multiprocessing module in the stdlib to achieve this.

By heavyset_go 2026-03-0917:213 reply

Threads are cheap, you can do N work simultaneously with N threads in one process, without serialization, IPC or process creation overhead.

With multiprocessing, processes are expensive and work hogs each process. You must serialize data twice for IPC, that's expensive and time consuming.

You shouldn't have to break out multiple processes, for example, to do some simple pure-Python math in parallel. It doesn't make sense to use multiple processes for something like that because the actual work you want to do will be overwhelmed by the IPC overhead.

There are also limitations, only some data can be sent to and from multiple processes. Not all of your objects can be serialized for IPC.

By akdev1l 2026-03-0917:42

I think you have a good point on IPC but process creation in Linux is almost as fast as thread creation

Unless the app would constantly be creating and killing processes then the process creation overhead would not be that much but IPC is killer

And also your types aren’t pickable or whatever and now you gotta change a lot of stuff to get it to work lol.

By connorboyle 2026-03-0920:581 reply

It makes sense to me that a program currently written using multiple processes would now be re-written to use multiple truly parallel threads. But it seems very odd to suggest (as your grandparent comment does) that a program currently run in multiple containers would likely be migrated to run on multiple threads.

In other words, I imagine anyone who cares about the overhead from serialization, IPC, or process creation would already be avoiding (as much as possible) using containers to scale in the first place.

By heavyset_go 2026-03-101:08

Yeah, I somehow glossed over the whole container thing.

The container thing might be horizontal scaling thing where 1 container runs on 1 instance with 1 vCPU, running multiple processes on instances means you need beefier slices of compute to take advantage of the parallelism, and you can't cleanly scale up and then down using only the resources you need.

If you have a queue distributing work, that model makes sense with single-threaded interpreters where consumers instances are spun up and down as needed, versus pushing work to a thread pool, or multiple instances with their own thread pools, that aren't inhibited by the GIL. The latter could be more efficient depending on the work.

By LtWorf 2026-03-106:08

But… in python threads don't run in parallel, which is the whole problem we are working around here.

By kccqzy 2026-03-0916:414 reply

Forking and multi threading do not coexist. Even if one of your transitive dependencies decides to launch a thread that’s 99% idle, it becomes unsafe to fork.

By rpcope1 2026-03-0917:28

Im curious as to the down votes on this. It's absolutely true, and when I was maintaining a job runner daemon that ran hundreds of thousands of who knows what Python tasks/jobs a day on some shared infra with arbitrary code for a certain megacorp from 2016-2020 or so, this was one of insidious and ugly failure modes to go debug and handle. The docs really make it sound like you can mix threading and multiprocessing but you can never really completely ensure that threading and then bare fork will ever be safe, period. It's really irritating that the docs would have you believe that this is OK or safe, but is in keeping with the Python philosophy of trying to hide the edge of the blade you're using until it's too late and you've cut the shit out of yourself.

By LtWorf 2026-03-0920:33

I'm replying to a person that scales python by running several containers instead of 1 container with several python processes.

By akdev1l 2026-03-0917:441 reply

Why is it unsafe?

By LtWorf 2026-03-0920:301 reply

In general only the thread calling fork() gets forked, so unless you call exec() soon after, there are a lot of complications with signals, shared memory.

By fc417fc802 2026-03-0921:022 reply

What are the complications? A single thread with its own process sandbox with everything from the parent is exactly what I'd expect coming from C land. Are the complications you refer to specific to the python VM or more general?

By grogers 2026-03-0922:571 reply

Even treating the process as read only after forking is potentially fraught. What if a background thread is mutating some data structure? When it forks the data structure might be internally inconsistent because the work to finish the mutation might not be completed. Imagine there are locks held by various threads when it dies, trying to lock those in the child might deadlock or even worse. There's tons of these types of gotchas.

By fc417fc802 2026-03-0923:31

Okay so just all the usual threading gotchas. Nothing specific to Python.

Conceptually fork "just" noncooperatively preempts and kills all other threads. Use accordingly. Yes it's a giant footgun but then so is all low level "unmanaged" concurrency.

By kccqzy 2026-03-0922:26

If you have multiple threads, you almost certainly have mutexes. If your fork happens when a non-main thread holds a mutex, your main thread will never again be able to hold that mutex.

An imperfect solution is to require every mutex created to be accompanied by some pthread_atfork, but libraries don’t do that unless forking is specifically requested. In other words, if you don’t control the library you can’t fork.

By philsnow 2026-03-0917:262 reply

Fork-then-thread works, does it not?

By kccqzy 2026-03-0917:401 reply

If you have enough discipline to make sure you only create threads after all the forking is done, then sure. But having such discipline is harder than just forbidding fork or forbidding threads in your program. It turns a careful analysis of timing and causality into just banning a few functions.

By josefx 2026-03-0920:401 reply

Can't you check what threads are active at the time you fork?

By kccqzy 2026-03-0922:30

And what do you do with that information? Refuse to fork after you detect more than one thread running? I haven’t seen any code that gracefully handles the unable-to-fork scenario. When people write fork-based code, especially in Python, they always expect forking to succeed.

By rpcope1 2026-03-0917:30

But not the reverse, if its a bare fork and not strictly using basically mutex and shared resource free code (which is hard), and there's little or no warning lights to indicate that this is a terrible idea that fails in really unpredictable and hard to debug ways.

By matsemann 2026-03-0912:32

For big things the current way works fine. Having a separate container/deployment for celery, the web server, etc is nice so you can deploy and scale separately. Mostly it works fine, but there are of course some drawbacks. Like prometheus scraping of things then not able to run a web server in parallel etc is clunky to work around.

And for smaller projects it's such an annoyance. Having a simple project running, and having to muck around to get cron jobs, background/async tasks etc. to work in a nice way is one of the reasons I never reach for python in these instances. I hope removing the GIL makes it better, but also afraid it will expose a whole can of worms where lots of apps, tools and frameworks aren't written with this possibility in mind.

By rpcope1 2026-03-0917:221 reply

> observability tooling for Python evolving

As much as I dislike Java the language, this is somewhere where the difference between CPython and JVM languages (and probably BEAM too) is hugely stark. Want to know if garbage collection or memory allocation is a problem in your long running Python program? I hope you're ready to be disappointed and need to roll a lot of stuff yourself. On the JVM the tooling for all kinds of observability is immensely better. I'm not hopeful that the gap is really going to close.

By mike_hearn 2026-03-1010:49

You can run Python on the JVM and then benefit from those tools!

By fiedzia 2026-03-0917:331 reply

> If true parallelism becomes common, it might actually reduce the number of containers/services needed for some workloads

Not by much. The cases where you can replace processes with threads and save memory are rather limited.

By aoeusnth1 2026-03-0917:50

Citation needed? Tall tasks are standard practice to improve utilization and reduce hotspots by reducing load variance across tasks.

By apothegm 2026-03-0913:182 reply

A lot of that has already been solved for by scaling workers to cores along with techniques like greenlets/eventlets that support concurrency without true multithreading to take better advantage of CPU capacity.

By kevincox 2026-03-0915:00

But you are still more or less limited to one CPU core per Python process. Yes, you can use that core more effectively, but you still can't scale up very effectively.

By Sohcahtoa82 2026-03-0921:151 reply

That's great for concurrency, but doesn't improve parallelism.

Unless you mean you have multiple worker processes (or GIL-free threads).

By apothegm 2026-03-1011:04

Yes, multiple worker processes is what I meant. Few web apps have a meaningful use for parallelism within a single process. So long as you’re keeping all cores busy with independent processes at high concurrency, multithreading adds relatively little.

YMMV if you’re doing a lot of number crunching.

By influx 2026-03-0919:451 reply

I would have thought most of those would have been moved to async Python by now.

By LtWorf 2026-03-0920:40

async python still uses a single thread for the main loop, it just hides non blocking IO.

By carlsborg 2026-03-0912:512 reply

Should have funded the entire GIL-removal effort by selling carbon credits. Here's an industry waiting to happen: issue carbon credits for optimizing CPU and GPU resource usage in established libraries.

By minimaxir 2026-03-0916:45

There's a spicy argument to be made that "Rewrite it in Rust" is actually an environmentalist approach.

By pradeeproark 2026-03-0913:061 reply

I am taking all the migration of electron apps.

By GuB-42 2026-03-0913:402 reply

I wonder about the total energy cost of apps like Teams, Slack, Discord, etc... Hundreds of millions of users, an app running constantly in the background. I wouldn't be surprised if the global power consumption on the clients side reached the gigawatt. Add the increased wear on the components, the cost of hardware upgrades, etc...

All that to avoid hiring a few developers to make optimized native clients on the most popular platforms. Popular apps and websites should lose or get carbon credits on optimization. What is negligible for a small project becomes important when millions of users get involved, and especially background apps.

By dr_zoidberg 2026-03-0914:021 reply

If we go by Microsofts 2020 account of 1 billion devices running Windows 10 [0], and assume all those are running some kind of electron app (or multiple?) you easily get your gigawatt by just saving 1 watt across each device (on average). I suspect you'd probably go higher than 1 gigawatt, but I'm not sure as far as making another order of magnitude. I also think the noisy fan on my notebook begs to differ and maybe the 10 GW mark could be doable...

[0] https://news.microsoft.com/apac/2020/03/17/windows-10-poweri...

By PaulHoule 2026-03-0914:522 reply

There are 30,000 different x-platform GUI frameworks and they all share one attribute: (1) they look embarrassingly bad compared to Electron or Native apps and they mostly (2) are terrible to program for.

I feel like I never wasting my time when I learn how to do things with the web platform because it turns out the app I made for desktop and tablet works on my VR headset. Sure if you are going to pay me 2x the market rate and it is a sure thing you might interest me in learning Swift and how to write iOS apps but I am not going to do it for a personal project or even a moneymaking project where I am taking some financial risk no way. The price of learning how to write apps for Android is that I have to also learn how to write apps for iOS and write apps for Windows and write apps for MacOS and decide what's the least-bad widget set for Linux and learn to program for it to.

Every time I do a shoot-out of Electron alternatives Electron wins and it is not even close -- the only real competitor is a plain ordinary web application with or without PWA features.

By bigstrat2003 2026-03-0917:161 reply

> Every time I do a shoot-out of Electron alternatives Electron wins and it is not even close

Only if you're ok with giving your users a badly performing application. If you actually care about the user experience, then Electron loses and it's not even close.

By PaulHoule 2026-03-0918:41

Name something specific. Note for two x-platform UI toolkits I have some familiarity with:

Python + tkinter == about the same size as electron

Java + JavaFX == about the same size as electron

Sure there are people who write little applets for software developers that are 20k Win32 applications still but that is really out of the mainstream.

By wiseowise 2026-03-0915:291 reply

Many times this. Native path is the path of infinite churn, ALL the time. With web you might find some framework bro who takes pride in knowing all the intricacies of React hooks who'll grill you for not dreaming in React/Vue/framework of the day, but fundamental web skills (JS/HTML/CSS) are universal. And you can pretty much apply them on any platform:

- iOS? React Native, Ionic, Web app via Safari

- Android? Same thing

- Mac, Windows, Linux – Tauri, Electron, serve it yourself

Native? Oh boy, here we fucking go: you've spent last decade honing your Android skills? Too bad, son, time to learn Android jerkpad. XML, styles, Java? What's that, gramps? You didn't hear that everything is Kotlin now? Dagger? That's so 2025, it's Hilt/Metro/Koin now. Oh wow, you learned Compose on Android? Man, was your brain frozen for 50 years? It's KMM now, oh wait, KMM is rebranded! It's KMP now! Haha, you think you know Compost? We're going to release half baked Compost multiplatform now, which is kinda the same, but not quite. Shitty toolchain and performance worse than Electron? Can't fucking hear you over jet engine sounds of my laptop exhaust, get on my level, boy!

By LtWorf 2026-03-0915:572 reply

Qt does exist. It's not difficult.

By scottcha 2026-03-0918:59

I actually built this analysis while I worked at Microsoft so I 100% agree. Doing the work at the platform level is the way to go and you can actually make a significant impact with this kind of approach. The other value of this that's not obvious is that doing it client side ends up touching all the grids/generators in the world outside of the market based accounting that tends to drive the datacenter carbon impact analysis.

By p_m_c 2026-03-0915:061 reply

> Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention.

Perhaps I'm stating the obvious, but you deal with this with lock-free data structures, immutable data, siloing data per thread, fine-grain locks, etc.

Basically you avoid locks as much as possible.

By nijave 2026-03-0917:091 reply

It'd be nice if Python std lib had more thread safe primitives/structures (compared to something like Java where there's tons of thread safe data structures)

Imo the GIL was used as an excuse for a long time to avoid building those out.

By liuliu 2026-03-0920:20

> It'd be nice if Python std lib had more thread safe primitives/structures (compared to something like Java where there's tons of thread safe data structures)

Hence why basic Python structures under free-threaded Python are all thread-safe structures, and explains why they are slower than GIL-variant.