
An overview of recent optimizations in Python's reference counting.
It's been a while since I've written about CPython internals and its optimizations. My last article on garbage collection was written 8 years ago.
A lot of small optimizations were added since then. In this article, I will highlight a new optimization for reference counting that uses a static lifetime analysis.
Reference counting is the primary memory management technique used in CPython.
In short, every Python object (the actual value behind a variable) has a reference counter field that tracks how many references point to it. When an object's reference count drops to zero, the memory occupied by that object is immediately deallocated.
For hot loops, this can lead to significant overhead due to the frequent incrementing and decrementing of reference counts. The counter must be updated whenever an object is referenced or dereferenced, which hurts performance and trashes CPU caches.
So, when you read a variable in Python, you actually write to the memory as well.
For more details, you can read my article on garbage collection.
Since Python 3.14, there is a new bytecode instruction called LOAD_FAST_BORROW.
This is an optimized instruction that avoids incrementing the reference count when loading local variables.
Let's look at this simple function:
def advance(vx, vy, steps): x, y = 1, 1 for _ in range(steps): x *= vx y *= vy return x, y
Since we check the value of vx and vy every iteration, their reference counts are incremented and decremented repeatedly.
Now, let's look at the bytecode output for different Python versions.
Python 3.13:
7 RESUME 0 8 LOAD_CONST 1 ((1, 1)) UNPACK_SEQUENCE 2 STORE_FAST_STORE_FAST 52 (x, y) 9 LOAD_GLOBAL 1 (range + NULL) LOAD_FAST 2 (steps) CALL 1 GET_ITER L1: FOR_ITER 11 (to L2) STORE_FAST 5 (_) 10 LOAD_FAST_LOAD_FAST 48 (x, vx) BINARY_OP 18 (*=) STORE_FAST 3 (x) 11 LOAD_FAST_LOAD_FAST 65 (y, vy) BINARY_OP 18 (*=) STORE_FAST 4 (y) JUMP_BACKWARD 13 (to L1) 9 L2: END_FOR POP_TOP 12 LOAD_FAST_LOAD_FAST 52 (x, y) BUILD_TUPLE 2 RETURN_VALUE
Python 3.15:
7 RESUME 0 8 LOAD_SMALL_INT 1 LOAD_SMALL_INT 1 STORE_FAST_STORE_FAST 67 (y, x) 9 LOAD_GLOBAL 1 (range + NULL) LOAD_FAST_BORROW 2 (steps) CALL 1 GET_ITER L1: FOR_ITER 19 (to L2) STORE_FAST 5 (_) 10 LOAD_FAST_BORROW_LOAD_FAST_BORROW 48 (x, vx) BINARY_OP 18 (*=) STORE_FAST 3 (x) 11 LOAD_FAST_BORROW_LOAD_FAST_BORROW 65 (y, vy) BINARY_OP 18 (*=) STORE_FAST 4 (y) JUMP_BACKWARD 21 (to L1) 9 L2: END_FOR POP_ITER 12 LOAD_FAST_BORROW_LOAD_FAST_BORROW 52 (x, y) BUILD_TUPLE 2 RETURN_VALUE
The only difference is that LOAD_FAST instructions were replaced with LOAD_FAST_BORROW.
The LOAD_FAST_BORROW_LOAD_FAST_BORROW is just a shorthand for loading two variables using a single opcode.
The LOAD_FAST opcode means "load a local variable and increment its reference count".
The LOAD_FAST_BORROW opcode means the same but without incrementing the reference count.
This optimization can only be applied when the following criteria are met:
/*
* Strength reduce LOAD_FAST{_LOAD_FAST} instructions into faster variants that
* load borrowed references onto the operand stack.
*
* This is only safe when we can prove that the reference in the frame outlives
* the borrowed reference produced by the instruction. We make this tractable
* by enforcing the following lifetimes:
*
* 1. Borrowed references loaded onto the operand stack live until the end of
* the instruction that consumes them from the stack. Any borrowed
* references that would escape into the heap (e.g. into frame objects or
* generators) are converted into new, strong references.
*
* 2. Locals live until they are either killed by an instruction
* (e.g. STORE_FAST) or the frame is unwound. Any local that is overwritten
* via `f_locals` is added to a tuple owned by the frame object.
*
* To simplify the problem of detecting which supporting references in the
* frame are killed by instructions that overwrite locals, we only allow
* borrowed references to be stored as a local in the frame if they were passed
* as an argument. {RETURN,YIELD}_VALUE convert borrowed references into new,
* strong references.
*
* Using the above, we can optimize any LOAD_FAST{_LOAD_FAST} instructions
* that meet the following criteria:
*
* 1. The produced reference must be consumed from the stack before the
* supporting reference in the frame is killed.
*
* 2. The produced reference cannot be stored as a local.
*
* We use abstract interpretation to identify instructions that meet these
* criteria. For each basic block, we simulate the effect the bytecode has on a
* stack of abstract references and note any instructions that violate the
* criteria above. Once we've processed all the instructions in a block, any
* non-violating LOAD_FAST{_LOAD_FAST} can be optimized.
*/
Which, in simpler terms, means:
For those who are familiar with Rust lifetimes, this will sound very similar. Python implements a basic version of lifetime analysis and immutable borrowing. To do so, CPython uses control flow graphs to analyze the lifetime of local variables at compile time.
In my code example, you may notice that the x and y variables are also borrowed.
This is because they are not modified at the time of borrowing.
They are only borrowed once per iteration to load the actual value for multiplication.
When borrowing happens, each object is tagged as "borrowed" so that reference counting operations have no effect on them.
Since the reference count is not incremented with the LOAD_FAST_BORROW, the decrement operation that a lot of opcodes
implement could destroy the object prematurely.
There is also an ongoing effort to eliminate redundant reference counting in JIT compiled code. Unlike regular bytecode, the current implementation of JIT only optimizes loops and is disabled by default. When JIT is enabled, the analysis of a specific loop only triggers after 4000 of iterations of it.
This seems like a pretty big deal that should have gotten more attention in the 3.14 release. It's fairly buried in the "What's New" document, but apparently the original issue is at https://github.com/python/cpython/issues/130704 and corresponding PR https://github.com/python/cpython/pull/130708 .
Has improving CPython performance become a huge focus in just the past five years or so, or is that just a perception issue on my end?
A developer wrote a paper in 2020 about how to make Python substantially (5x) faster.
Microsoft then funded a 'Fast CPython' team that included Guido, to realize that goal. They disbanded the team in June.
https://devblogs.microsoft.com/python/python-311-faster-cpyt...
Because of how much of a cornerstone python generally has been in AI circles, performance improvements have gotten a lot more attention in the past few years.
Is python performance actually relevant to AI use cases? It's my understanding that all of the actual number crunching is done with native code.
Sure, but that doesn't mean there's no ergonomic benefits to the surrounding code being faster. Plus there's just more eyes and attention on the ecosystem.
Guido gave up the idea of keeping CPython interpreter simple. It's now a complex beast with JIT and a lot of small optimizations.