Don't "optimize" conditional moves in shaders with mix()+step()

2025-02-0912:42409231iquilezles.org

In this article I want to correct a popular misconception that's been making the rounds in computer graphics aficionado circles for a long time now. It has to do with branching in the GPUs.…

Show article

In this article I want to correct a popular misconception that's been making the rounds in computer graphics aficionado circles for a long time now. It has to do with branching in the GPUs. Unfortunately there are a couple of educational websites out there that are spreading some misinformation and it would be nice correcting that. I tried contacting the authors without success, so without further ado, here goes my attempt to fix things up: So, say I have this code, which I actually published the other day:

vec2 snap45( in vec2 v ) { vec2 s = sign(v); float x = abs(v.x); return x>0.923880?vec2(s.x,0.0): x>0.382683?s*sqrt(0.5): vec2(0.0,s.y); }

The exact details of what it does don't matter for this discussion. All we care about is the two ternary operations, which as you know, implement conditional execution. Indeed, depending on the value of the variable x, the function will return different results. This could be implemented also with regular if statements, and all that I'm going to say stays the same. But here's the problem - when seeing code like this, somebody somewhere will invariably propose the following "optimization", which replaces what they believe (erroneously) are "conditional branches" by arithmetical operations. They will suggest something like this:

vec2 snap45( in vec2 v ) { vec2 s = sign(v); float x = abs(v.x); float w0 = step(0.92387953,x); float w1 = step(0.38268343,x)*(1.0-w0); float w2 = 1.0-w0-w1; vec2 res0 = vec2(s.x,0.0); vec2 res1 = vec2(s.x,s.y)*sqrt(0.5); vec2 res2 = vec2(0.0,s.y); return w0*res0 + w1*res1 + w2*res2; }

There are two things wrong with this practice. The first one shows an incorrect understanding of how the GPU works. In particular, the original shader code had no conditional branching in it. Selecting between a few registers with a ternary operator or with a plain if statement does not lead to conditional branching; all it involves is a conditional move (a.k.a. "select"), which is a simple instruction to route the correct bits to the destination register. You can think of it as a bitwise AND+NAND+OR on the source registers, which is a simple combinational circuit. Again, there is no branching - the instruction pointer isn't manipulated, there's no branch prediction involved, no instruction cache to invalidation, no nothing. For the record, of course real branches do happen in GPU code, but those are not what's used by the GPU for small moves between registers like we have here. This is true for any GPU made in the last 20+ years. While I'm not an expert in CPUs, I am pretty sure this is true for them as well.

The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this:

float step( float x, float y ) { return x < y ? 1.0 : 0.0; }

So people using the step() "optimization" are using the ternary operation anyways, which produces the 0.0 or 1.0 which they will use to select the output, only wasting two multiplications and one or two additions. The values could have been conditionally moved directly, which is what the original shader code did. But don't take my word for it, let's look at the generated machine code for the relevant part of the shader I published:

GLSL

return x>0.923880?vec2(s.x,0.0): x>0.382683?s*sqrt(0.5): vec2(0.0,s.y);

AMD Compiler

s_mov_b32 s0, 0x3ec3ef15 v_mul_f32 v3, 0x3f3504f3, v1 v_mul_f32 v4, 0x3f3504f3, v0 s_mov_b32 s1, 0x3f6c835e v_cmp_gt_f32 vcc, abs(v2), s0 v_cndmask_b32 v3, 0, v3, vcc v_cndmask_b32 v0, v0, v4, vcc v_cmp_ngt_f32 vcc, abs(v2), s1 v_cndmask_b32 v1, v1, v3, vcc v_cndmask_b32 v0, 0, v0, vcc

Microsoft Compiler

lt r0.xy, l(0, 0), v0.xy lt r0.zw, v0.xy, l(0, 0) iadd r0.xy, -r0.xyxx, r0.zwzz itof r0.xy, r0.xyxx mul r1.xyzw, r0.xyxy, l4(0.707107) lt r2.xy, l(0.923880,0.382683), |v0.xx| mov r0.z, l(0) movc r1.xyzw, r2.yyyy, r1.xyzw, r0.zyzy movc o0.xyzw, r2.xxxx, r0.xzxz, r1.xyzw

Here we can see that the GPU is not branching. Instead, according to the AMD compiler, it's performing the required comparisons (v_cmp_gt_f32 and v_cmp_ngt_f32 - cmp=compare, gt=greater than, ngt=not greated than), and then using the result to mask the results with the bitwise operations mentioned earlier (v_cndmask_b32 - cnd=conditional).

The Microsoft compiler has expressed the same idea/implementation in a different format, but you can still see the comparison (lt - "lt"=less than) and the masking or conditional move (movc - mov=move, c=conditionally).

Not related to the discussion, but also note that the abs() call does not become a GPU instruction and instead becomes an instruction modifier, which is free.

So, if you ever see somebody proposing this

float a = mix( b, c, step( y, x ) );

as an optimization to

float a = x < y ? b : c

then please correct them for me. The misinformation has been around for 20 years / 10 GPU generation, and that's more than too long. Thanks!

Read the original article

romes

Karma: 1129

@Hacker__News
@hacker._news

Comments

By quuxplusone 2025-02-0914:143 reply

I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for both versions, instead of just the better version. Quote:

"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"

—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.

By azeemba 2025-02-0914:454 reply

The main point is that the conditional didn't actually introduce a branch.

Showing the other generated version would only show that it's longer. It is not expected to have a branch either. So I don't think it would have added much value

By comex 2025-02-0921:581 reply

But it's possible that the compiler is smart enough to optimize the step() version down to the same code as the conditional version. If true, that still wouldn't justify using step(), but it would mean that the step() version isn't "wasting two multiplications and one or two additions" as the post says.

(I don't know enough about GPU compilers to say whether they implement such an optimization, but if step() abuse is as popular as the post says, then they probably should.)

By MindSpunk 2025-02-101:102 reply

Okay but how does this help the reader? If the worse code happens to optimize to the same thing it's still awful and you get no benefits. It's likely not to optimize down unless you have fast-math enabled because the extra float ops have to be preserved to be IEEE754 compliant

By account42 2025-02-109:11

Fragment and vertex shaders generally don't target strict IEEE754 compliance by default. Transforming a * (b ? 1.0 : 0.0) into b ? a : 0.0 is absolutely something you can expect a shader compiler to do - that only requires assuming a is not NaN.

By burnished 2025-02-102:462 reply

..how is it awful if it has the same result?

By dcrazy 2025-02-102:50

Because it perpetuates a misconception and is harder to read.

By seba_dos1 2025-02-102:51

Just look at it.

By idunnoman1222 2025-02-0915:332 reply

Unless you’re writing an essay on why you’re right…

By chrisjj 2025-02-0916:22

> Unless you’re writing an essay on why you’re right…

He's writing an essay on why they are wrong.

"But here's the problem - when seeing code like this, somebody somewhere will invariably propose the following "optimization", which replaces what they believe (erroneously) are "conditional branches" by arithmetical operations."

Hence his branchless codegen samples are sufficient.

Further, regarding.the side-issue "The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower", no amount of codegen is going to show lower /speed/.

By ncruces 2025-02-0916:30

The other either optimizes the same, or has an additional multiplication, and it's definitely less readable.

By TheRealPomax 2025-02-0918:12

Correct: it would show proof instead of leaving it up to the reader to believe them.

By Lockal 2025-02-0923:17

You missed the second part where article says that "it actually runs much slower than the original version", "wasting two multiplications and one or two additions", based on idea that compiler is unable to do a very basic optimization, implying that compiler compiler will actually multiply by one. No benchmarks, no checking assembly, just straightforward misinformation.

By creata 2025-02-0923:23

Generated code for RDNA 1:

https://shader-playground.timjones.io/5d3ece620f45091678dcee...

By stevemk14ebr 2025-02-0916:182 reply

There are 10 types of people in this work. Those who can extrapolate from missing data, and

By account42 2025-02-109:22

Making assumptions about performance when you can measure is generally not a good idea.

By robertlagrant 2025-02-1011:11

and what? AND WHAT?

By alkonaut 2025-02-0914:5611 reply

I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.

By pandaman 2025-02-0918:414 reply

>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.

You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.

So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.

Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.

By account42 2025-02-109:261 reply

> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.

It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.

You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.

By pandaman 2025-02-1012:18

And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.

By ribit 2025-02-1114:451 reply

Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.

By pandaman 2025-02-120:34

No, branching works on GPU just like everywhere else - the instruction pointer gets changed to another value. But you cannot branch on a vector value unless every element of the vector is the same, this is why branching on vector values is a bad idea. However, if your vectorized computation is naturally divergent then there is no way around it, conditional moves are not going to help as they also will evaluate both branches in a conditional. The best you can do is to arrange it in such a way that you only add computation instead of alternating it, i.e. you do if() ... instead of if() ... else ... then you only take as long as the longest path.

This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.

By ryao 2025-02-0923:461 reply

You can always have the compiler dump the assembly output so you can examine it. I suspect few do that.

By vanderZwan 2025-02-108:292 reply

Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck

(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)

By swiftcoder 2025-02-109:251 reply

It's all JIT'd based on the specific driver/GPU, but the intermediate assembly language is sufficient to inspect things like branches and loop unrolling.

By grg0 2025-02-113:22

Not really. DXIL in particular will still have branches and not care much about unrolling. You need to look at the assembly that is generated. And yes, that depends on the target hardware and compiler/driver.

By account42 2025-02-109:32

You will have to check for the different GPUs you are targetting. But GPU vendors don't start from scratch for each hardware generation so you will often see similar results.

By torginus 2025-02-1019:201 reply

I'll comment this here as I got downvoted when I made the point in a standalone comment - this is mostly an academic issue, since you don't want to use step of pixel-level if statements in your shader code, as it will lead to ugly aliasing artifacts as the pixel color transitions from a to b.

What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.

By pandaman 2025-02-110:031 reply

It's absurd to claim that you'd never use step(), even in pixel shaders (there are all kinds of shaders not related to pixels at all).

By torginus 2025-02-118:581 reply

>since you don't want to use step of pixel-level if statements in your shader code

The observation relates to pixel shaders, and even within that, it relates to values that vary based on pixel-level data. In these cases having if statements without any sort of interpolation introduces aliasing, which tends to look very noticeable.

Now you might be fine with that, or have some way of masking it, so it might be fine in your use case, but most in the most common, naive case the issue does show up.

By pandaman 2025-02-1112:371 reply

I don't know how many graphics products you shipped and when, but, say, clamping values at 0, is pretty common even in most basic shaders. It's not magic and won't introduce "aliasing" just for the fact of using it. On the other hand, for example, using negative dot products in you lighting computation will introduce bizarre artifacts. And yes, everyone uses various forms of MSAA for the past 15 years or so even in games. Welcome to the 21st century.

By torginus 2025-02-1114:31

The way you write seems to imply you have professional experience in the matter, which makes it very strange you're not getting what I'm writing about.

Nobody ever talked about clamping - and it's not even relavant to the discussion as it doesn't introduce discontinuity that can cause aliasing.

What I'm referring to is shader aliasing, which MSAA does nothing about - MSAA is for geometry aliasing.

To illustrate what I'm talking about with, an example that draws a red circle on a quad:

The bad version:

    gl_FragColor = vec4(vec3(1.0 - step(0.25, distance(vUv, vec2(0.5)))) * vec3(1.0, 0.0, 0.0), 1.0);

The good version:

    gl_FragColor = vec4(vec3(1.0 - smoothstep(0.24, 0.25, distance(vUv, vec2(0.5)))) * vec3(1.0, 0.0, 0.0), 1.0);

The first version has a hard boundary for the circle which has an ugly aliased and pixelated contour, while the latter version smooths it. This example might not be egregious, but this can and does show up in some circumstances.

By ajross 2025-02-0915:511 reply

> it's also concerning that we have syntax where an if is some times a branch and some times not.

That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.

By masklinn 2025-02-0917:58

For scalar CPUs, historically CMOV used to be relatively slow on x86, and notably for reliable branching patterns (>75% reliable) branches could be a lot faster.

cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.

Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.

By account42 2025-02-109:49

You got this the wrong way around: For GPUs conditional moves are the default and real branches are a performance optimization possible only if the branch is uniform (=same side taken for the entire workgroup).

By mpreda 2025-02-0918:121 reply

Exactly. Consider this example:

  a = f(z);
  b = g(z);
  v = x > y ? a : b;

Assuming computing the two function calls f() and g() is relativelly expensive, it becomes a trade-off whether to emit conditional code or to compute both followed by a select. So it's not a simple choice, and the decision is made by the compiler.

By dragontamer 2025-02-0919:402 reply

This is a GPU focused article.

The GPU will almost always execute f and g due to GPU differences vs CPU.

You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.

By justsid 2025-02-0921:19

It depends. If the code flow is uniform for the warp, only side of the branch needs to be evaluated. But you could still end up with pessimistic register allocation because the compiler can’t know it is uniform. It’s sometimes weirdly hard to reason about how exactly code will end up executing on the GPU.

By danybittel 2025-02-105:091 reply

f or g may have side effects too. Like writing to memory. Now a conditional has a different meaning.

You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.

By account42 2025-02-109:38

Side effects will be masked, the GPU is still executing exactly the same code for the entire workgroup.

By plagiarist 2025-02-0916:341 reply

I think that capability in the shader language would be interesting to have. One might even want it to two-color all functions in the code. Anything annotated nonbranching must have if statements compile down to conditional moves and must only call nonbranching functions.

By catlifeonmars 2025-02-0917:36

This is also very relevant for cryptography use cases, where branching is a potential side channel for leaking secret information.

By grg0 2025-02-113:19

The good way of knowing is to look at the assembly generated by the compiler. Maybe not a completely satisfying answer given that the result is heavily vendor-dependent, but unless the high-level language exposes some way of explicitly controlling it, then assembly is what you got.

> The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

This is a problem, though. People shouldn't do things potentially, they should look at the actual code that is generated and executed.

By mwkaufma 2025-02-0919:44

One can do the precisely how it's done in the article -- inspect the assembly.

By chrisjj 2025-02-0916:301 reply

The good way is to inspect the code :)

> it's also concerning that we have syntax where an if is some times a branch and some times not.

It would be more concerning if we didn't. We might get a branch on one GPU and none on another.

By phkahler 2025-02-0917:511 reply

>> The good way is to inspect the code :)

The best way is to profile the code. Time is what we are after, so measure that.

By chrisjj 2025-02-1121:51

For sure.

By nice_byte 2025-02-0916:50

godbolt has rga compiler now, you can always paste in hlsl and look at the actual rdna instructions that are generated (what GPU actually runs, not spirv)

By NohatCoder 2025-02-0920:121 reply

But you don't generally need to care if the shader code contains a few branches, modern GPUs handles those reasonably well, and the compiler will probably make a reasonable guess about what is fastest.

By account42 2025-02-109:421 reply

You do need to care about large non-uniform branches as in the general case the GPU will have to execute both sides.

By NohatCoder 2025-02-1016:51

A non-branching version of the same algorithm will also run code equivalent to both branches. The branching version may sometimes skip one of the branches, the non-branching version can't. So if the functionality you want is best described by a branch, then use a branch.

By nosferalatu123 2025-02-104:192 reply

A lot of the myth that "branches are slow on GPUs" is because, way back on the PlayStation 3, they were quite slow. NVIDIA's RSX GPU was on the PS3; it was documented that it was six cycles IIRC, but it always measured slower than that to me. That was for even a completely coherent branch, where all threads in the warp took the same path. Incoherent branches were slower because the IFEH instruction took six cycles, and the GPU would have to execute both sides of the branch. I believe that was the origin of the "branches are slow on GPUs" myth that continues to this day. Nowadays GPU branching is quite cheap especially coherent branches.

By dahart 2025-02-1015:54

If someone says branching without qualification, I have to assume it’s incoherent. The branching mechanics might have lower overhead today, but the basic physics of the situation is that throughput on each side of the branch is reduced to the percentage of active threads. If both sides of a branch are taken, and both sides are the same instruction length, the average perf over both sides is at least cut in half. This is why the belief that branches are slow on GPUs is both persistent and true. And this is why it’s worth trying harder to reformulate the problem without branching, if possible.

By nice_byte 2025-02-107:01

coherent branches are "free" but the extra instructions increase register pressure. that's the main reason why dynamic branches are avoided, not that they are inherently "slow".