Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it

2025-10-034:21338166github.com

Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...

You can’t perform that action at this time.


Read the original article

Comments

  • By nulld3v 2025-10-035:412 reply

    https://github.com/triton-lang/triton/pull/7298#discussion_r...

    > By disassembly of ptxas, it is indeed hard-coded that they have logic like: strstr(kernel_name, "cutlass").

    > it is likely that, this is an unstable, experimental, aggressive optimization by NVIDIA, and blindly always enabling it may produce some elusive bugs.

    • By frogblast 2025-10-0315:042 reply

      Often not elusive bugs, but elusive performance. GPU compilers are hard: Once you've done the basics, trying to do further transforms in a mature compiler will almost always produced mixed results. Some kernels will go faster, some will go slower, and you're hoping to move the balance and not hit any critical kernel too hard in your efforts to make another go faster.

      An optimization with a universal >=0 speedup across your entire suite of tests is a really hard thing to come by. Something is always going to have a negative speedup.

      My experience is with non-Nvidia GPU systems, but this feels like a familiar situation. They probably found something that has great outcomes for one set of kernels, terrible outcomes for another, and no known reliable heuristic or modeling they could use to automatically choose.

      • By Eridrus 2025-10-0315:561 reply

        A saner design would turn this optimization into a documented flag that anyone can opt into.

        • By rcoveson 2025-10-0316:583 reply

          Speaking from a place of long-term frustration with Java, some compiler authors just absolutely hate exposing the ability to hint/force optimizations. Never mind that it might improve performance for N-5 and N+5 major releases, it might be meaningless or unhelpful or difficult to maintain in a release ten years from now, so it must not be exposed today.

          • By recursivecaveat 2025-10-0322:18

            I once exposed a "disableXYZOptimization" flag to customers so they could debug a easier without stuff getting scrambled. Paid for my gesture for the next year signing off on release updates, writing user guide entries, bleh.

          • By Eridrus 2025-10-064:20

            So it's better to hardcode your specific library name and deal with the same issue after people have reverse engineered it and started depending on it anyway?

          • By MichaelZuo 2025-10-0317:101 reply

            That seems valid for customers expecting a warranty or support. But they should allow it if customers waive all such in writing.

            • By Dylan16807 2025-10-0319:301 reply

              Warranty and support specifically for that flag? Because I don't see how general warranty and support requires keeping any hint flags forever.

              • By shadowpho 2025-10-040:481 reply

                If you remove the hint flag peoples build will break

                • By Dylan16807 2025-10-0421:531 reply

                  Doesn't need to, it can acknowledge and ignore the hints.

                  • By shadowpho 2025-10-056:501 reply

                    True, but there might be more problems — like if you drop support their run time will be slow because they rely on this flag and they are unhappy

                    • By Dylan16807 2025-10-0513:22

                      The premise of removing the flag is that it's useless or a problem. If it's still causing a big speed boost somewhere then you need to figure something out, but the core scenario here is that it's obsolete.

      • By godelski 2025-10-042:51

          > An optimization with a universal >=0 speedup across your entire suite of tests is a really hard thing to come by. Something is always going to have a negative speedup.
        
        Maybe a common example of this is that people can write matrix matrix multiplication kernels that outperform standard implementations (also in BLAS for CPU). But that's not a General Matrix Matrix multiply. Is the speedup still there for spare matrices? Larger ones? Small ones? Ones that aren't powers of 2? Non-square? And so on. You can beat the official implementation in any one of these but good luck doing it everywhere. In fact, you should beat the official method because you don't have the overhead to check which optimization you should use.

        It's easy to over simplify a problem and not even realize you have done so. There's always assumptions being made and you should not let these be invisible.

    • By temp0826 2025-10-035:48

      Thanks for a little context, this is not my wheelhouse at all (never even heard of this project) and I could not make heads or tails of the title or the linked PR.

  • By haunter 2025-10-036:427 reply

    Heh. Does anyone remember when almost 25 years ago ATI (AMD) caught manipulating the Quake III benchmarks by renaming the executables to ‘quack’?

    https://web.archive.org/web/20230929180112/https://techrepor...

    https://web.archive.org/web/20011108190056/https://hardocp.c...

    https://web.archive.org/web/20011118183932/www.3dcenter.de/a...

    • By mattlondon 2025-10-038:213 reply

      Just in case anyone else parsed that sentence the same way as me, ati detected "quake" as the executable and changed things like texture quality etc to increase benchmark performance. Some people discovered this after they renamed the executable to "quack" and the image quality improved but the benchmarks were lower, proving that the ati drivers "optimised" by reducing quality.

      Ati did not rename quake to quack as I originally thought from this! :)

      • By Dwedit 2025-10-0315:11

        The story was that they used a lower mipmap level (blurrier textures) when the process was named Quake, but used the normal mipmap level (standard textures) when the process was named Quack.

      • By a_wild_dandan 2025-10-038:391 reply

        Thank you for explaining. I was so confused at how AMD was improving Quake performance with duck-like monikers.

        • By stavros 2025-10-039:492 reply

          Well, if it _looks_ like a high-performance texture renderer, and it _walks_ like a high-performance texture renderer...

          • By _joel 2025-10-0312:181 reply

            It's probably been duck typed

          • By taneq 2025-10-0313:54

            If it looks like a benchmark and it quacks like a benchmark… duck?

      • By robotresearcher 2025-10-0317:16

        So the additional performance came with a large bill?

    • By a-french-anon 2025-10-037:561 reply

      Or Intel checking for "GenuineIntel" in ICC's output: https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler#Support...

      • By cratermoon 2025-10-0315:041 reply

        Or Win 3.1 looking for whatever shibboleth was in MS-DOS and popping up a scary-looking message if it found another DOS? https://en.wikipedia.org/wiki/AARD_code

        • By keanb 2025-10-0315:101 reply

          I don’t think anybody remembers this since that code never shipped in retail.

          • By GeekyBear 2025-10-0319:421 reply

            It didn't ship (in the final retail version) only after the tech press of the day exposed what Microsoft had done.

            • By cratermoon 2025-10-0319:47

              It did ship in the final retail version, in way. It was disabled, but the code was still there, and a flag was all that was needed to enable it.

    • By torginus 2025-10-0313:062 reply

      Every vendor does this to this day - and its a morally grey practice, drivers hijack and modify the rendering loops of popular games, fixing bugs, replacing shaders with more optimized versions, enabling faster codepaths in the driver etc.

      These changes are supposed to have minimal to no impact on the actual output, but sometimes vendors are really aggressive, and significantly degrade the outputs so that the game can run faster on their hardware.

      • By surajrmal 2025-10-0314:571 reply

        Sadly it's built into the vulkan protocol. Even a fully userspace driver arrangement with a microkernel ends up giving the driver access to the client's information. Of course it's forgeable the way it's done though so you could opt out if you really wanted to.

        [1]: https://github.com/KhronosGroup/Vulkan-Headers/blob/main/inc...

        • By doubletwoyou 2025-10-0315:54

          I mean Khronos put that in for a reason. If the drivers didn't get explicit information about the application being run, they would do silly heuristics like quake3 to squeeze out performance.

      • By Aurornis 2025-10-0313:221 reply

        > but sometimes vendors are really aggressive, and significantly degrade the outputs so that the game can run faster on their hardware.

        Do you have a source for this? I’d like to see some examples

        • By AnotherGoodName 2025-10-0315:132 reply

          Nvidia has a control panel with it's drivers. Open it up -> Manage 3D settings -> Program Settings. Scroll through and see how every single program/game you have installed openly has different defaults in it based on application name. As someone noted above others do the same thing.

          Eg. Frostpunk has Antialiasing for transparency layers on. Slay the spire does not. I never set these settings. Nvidia literally does a lookup on first run for what they judge as best defaults and sets these appropriately.

          Every single game/program you install has different options from a huge list of possible optimizations.

          • By umanwizard 2025-10-0316:121 reply

            Applying different standard settings is pretty different from "hijacking and modifying the rendering loop", though.

            • By altairprime 2025-10-0320:121 reply

              In what sense? The render loop is modified from “the” default without user or program opt-in, and “hijacking” is what it would be called if anyone but Nvidia did it — so Nvidia is not exempt from that use. Though: Runtime patch, haxie, hijack, LD_PRELOAD, system extension; the noun changes every few years, so perhaps it’s time for a new one. Override?

              • By umanwizard 2025-10-0418:08

                But the comment I replied to wasn’t talking about runtime patching or any of the other settings you mentioned. It was talking about changing GPU settings for specific programs. Not changing anything about the program itself.

          • By dahart 2025-10-0321:53

            That’s not what @torginus was referring to. There’s nothing wrong with having and exposing application specific settings. There’s nothing wrong with drivers having application specific optimization patches either, but that’s a very different thing.

    • By bayindirh 2025-10-039:12

      For more context and deeper discussion on the subject, see https://news.ycombinator.com/item?id=44531107

      Funnily, it's under an older submission of the same cutlass optimizations.

    • By Cthulhu_ 2025-10-039:373 reply

      This is weirdly common; phone chipset manufacturers did it with phone benchmarks [0], VW with emissions [1], nVidia did it with 3DMark [2], Intel with the SPEC benchmark for its Xeon processors [3], etc.

      When it comes to computer graphics, iirc it's pretty normalized now - graphics drivers all seem to have tweaks, settings, optimizations and workarounds for every game.

      (As an aside, I hate that I have to link to archive.org, there's a lot of dead links nowadays but these are important things to remember).

      [0] https://web.archive.org/web/20250306120819/https://www.anand...

      [1] https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

      [2] https://web.archive.org/web/20051218120547/http://techreport...

      [3] https://www.servethehome.com/impact-of-intel-compiler-optimi...

      • By cesarb 2025-10-0312:59

        > When it comes to computer graphics, iirc it's pretty normalized now - graphics drivers all seem to have tweaks, settings, optimizations and workarounds for every game.

        Even Mesa has them: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/uti...

      • By darkmighty 2025-10-0316:063 reply

        > graphics drivers all seem to have tweaks, settings, optimizations and workarounds for every game.

        Maybe hyperbole, but I think obviously they can't do this for literally every game, that would require huge personnel resources. At least looking at mesa (linked elsewhere), only ~200 games are patched, out of what 100k PC games? So <1%.

        • By account42 2025-10-0614:08

          Mesa is a lot more conservative about this than the proprietary drivers.

        • By 0x457 2025-10-0318:58

          Well, pretty much every large AAA game launch complimented by GPU driver upgrade that adds support for that game. It's in the patch notes.

      • By wat10000 2025-10-0315:50

        Goodhart's law: when a measure becomes a target, it ceases to be a good measure.

    • By mort96 2025-10-039:091 reply

      Are there any archives of that techreport article with images intact?

    • By Adachi91 2025-10-0322:16

      I feel like omitting AMD is relevant here for anyone who doesn't know the acquisition history of ATI, AMD had no involvement with this.

  • By high_na_euv 2025-10-039:533 reply

    I work with compilers

    And despite it not being nice, some optimizations rely on type or function names schemas/substrings/etc

    It sucks, but thats how it works

    It doesnt have to be malicious just sometimes it is safer to deploy optimization only for your libs than risk breaking stuff

    Or your frontend is not giving you more data which you can rely on

    • By tliltocatl 2025-10-0311:36

      It is probably not malicious, but it certainly does create new barriers, which is not a good thing.

    • By wiz21c 2025-10-0313:113 reply

      On function types or schema, I can understand that. But names ?

      • By mhh__ 2025-10-0313:18

        Something like:

            if(AskLLM("Does function signature+name look like error handling code")) {
                TurnOffInliner();
            }
        
        is actually probably a lot more effective than you'd think (generating PGO traces with a machine learning tool is apparently a thing that sort of works)

      • By colejohnson66 2025-10-041:341 reply

        Not exactly the same, but intrinsics in some languages are purely name+signature matched.

        • By high_na_euv 2025-10-0610:48

          Sure, but will anyone name a method like "__nvidia_experimental_feature_xyz_v1"?

      • By high_na_euv 2025-10-0610:30

        Yes.

        E.g "__nvidia_experimental_feature_xyz_v1"

    • By Hizonner 2025-10-0313:07

      > than risk breaking stuff

      ... until somebody randomly chooses the same name for some reason and gets hosed.

      You're not helping.

HackerNews