Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...
https://github.com/triton-lang/triton/pull/7298#discussion_r...
> By disassembly of ptxas, it is indeed hard-coded that they have logic like: strstr(kernel_name, "cutlass").
> it is likely that, this is an unstable, experimental, aggressive optimization by NVIDIA, and blindly always enabling it may produce some elusive bugs.
Often not elusive bugs, but elusive performance. GPU compilers are hard: Once you've done the basics, trying to do further transforms in a mature compiler will almost always produced mixed results. Some kernels will go faster, some will go slower, and you're hoping to move the balance and not hit any critical kernel too hard in your efforts to make another go faster.
An optimization with a universal >=0 speedup across your entire suite of tests is a really hard thing to come by. Something is always going to have a negative speedup.
My experience is with non-Nvidia GPU systems, but this feels like a familiar situation. They probably found something that has great outcomes for one set of kernels, terrible outcomes for another, and no known reliable heuristic or modeling they could use to automatically choose.
A saner design would turn this optimization into a documented flag that anyone can opt into.
Speaking from a place of long-term frustration with Java, some compiler authors just absolutely hate exposing the ability to hint/force optimizations. Never mind that it might improve performance for N-5 and N+5 major releases, it might be meaningless or unhelpful or difficult to maintain in a release ten years from now, so it must not be exposed today.
I once exposed a "disableXYZOptimization" flag to customers so they could debug a easier without stuff getting scrambled. Paid for my gesture for the next year signing off on release updates, writing user guide entries, bleh.
So it's better to hardcode your specific library name and deal with the same issue after people have reverse engineered it and started depending on it anyway?
That seems valid for customers expecting a warranty or support. But they should allow it if customers waive all such in writing.
Warranty and support specifically for that flag? Because I don't see how general warranty and support requires keeping any hint flags forever.
Doesn't need to, it can acknowledge and ignore the hints.
True, but there might be more problems — like if you drop support their run time will be slow because they rely on this flag and they are unhappy
The premise of removing the flag is that it's useless or a problem. If it's still causing a big speed boost somewhere then you need to figure something out, but the core scenario here is that it's obsolete.
> An optimization with a universal >=0 speedup across your entire suite of tests is a really hard thing to come by. Something is always going to have a negative speedup.
Maybe a common example of this is that people can write matrix matrix multiplication kernels that outperform standard implementations (also in BLAS for CPU). But that's not a General Matrix Matrix multiply. Is the speedup still there for spare matrices? Larger ones? Small ones? Ones that aren't powers of 2? Non-square? And so on. You can beat the official implementation in any one of these but good luck doing it everywhere. In fact, you should beat the official method because you don't have the overhead to check which optimization you should use.It's easy to over simplify a problem and not even realize you have done so. There's always assumptions being made and you should not let these be invisible.
Thanks for a little context, this is not my wheelhouse at all (never even heard of this project) and I could not make heads or tails of the title or the linked PR.
Heh. Does anyone remember when almost 25 years ago ATI (AMD) caught manipulating the Quake III benchmarks by renaming the executables to ‘quack’?
https://web.archive.org/web/20230929180112/https://techrepor...
https://web.archive.org/web/20011108190056/https://hardocp.c...
https://web.archive.org/web/20011118183932/www.3dcenter.de/a...
Just in case anyone else parsed that sentence the same way as me, ati detected "quake" as the executable and changed things like texture quality etc to increase benchmark performance. Some people discovered this after they renamed the executable to "quack" and the image quality improved but the benchmarks were lower, proving that the ati drivers "optimised" by reducing quality.
Ati did not rename quake to quack as I originally thought from this! :)
The story was that they used a lower mipmap level (blurrier textures) when the process was named Quake, but used the normal mipmap level (standard textures) when the process was named Quack.
Thank you for explaining. I was so confused at how AMD was improving Quake performance with duck-like monikers.
Well, if it _looks_ like a high-performance texture renderer, and it _walks_ like a high-performance texture renderer...
So the additional performance came with a large bill?
Or Intel checking for "GenuineIntel" in ICC's output: https://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler#Support...
Or Win 3.1 looking for whatever shibboleth was in MS-DOS and popping up a scary-looking message if it found another DOS? https://en.wikipedia.org/wiki/AARD_code
I don’t think anybody remembers this since that code never shipped in retail.
It didn't ship (in the final retail version) only after the tech press of the day exposed what Microsoft had done.
It did ship in the final retail version, in way. It was disabled, but the code was still there, and a flag was all that was needed to enable it.
Every vendor does this to this day - and its a morally grey practice, drivers hijack and modify the rendering loops of popular games, fixing bugs, replacing shaders with more optimized versions, enabling faster codepaths in the driver etc.
These changes are supposed to have minimal to no impact on the actual output, but sometimes vendors are really aggressive, and significantly degrade the outputs so that the game can run faster on their hardware.
Sadly it's built into the vulkan protocol. Even a fully userspace driver arrangement with a microkernel ends up giving the driver access to the client's information. Of course it's forgeable the way it's done though so you could opt out if you really wanted to.
[1]: https://github.com/KhronosGroup/Vulkan-Headers/blob/main/inc...
I mean Khronos put that in for a reason. If the drivers didn't get explicit information about the application being run, they would do silly heuristics like quake3 to squeeze out performance.
> but sometimes vendors are really aggressive, and significantly degrade the outputs so that the game can run faster on their hardware.
Do you have a source for this? I’d like to see some examples
Nvidia has a control panel with it's drivers. Open it up -> Manage 3D settings -> Program Settings. Scroll through and see how every single program/game you have installed openly has different defaults in it based on application name. As someone noted above others do the same thing.
Eg. Frostpunk has Antialiasing for transparency layers on. Slay the spire does not. I never set these settings. Nvidia literally does a lookup on first run for what they judge as best defaults and sets these appropriately.
Every single game/program you install has different options from a huge list of possible optimizations.
Applying different standard settings is pretty different from "hijacking and modifying the rendering loop", though.
In what sense? The render loop is modified from “the” default without user or program opt-in, and “hijacking” is what it would be called if anyone but Nvidia did it — so Nvidia is not exempt from that use. Though: Runtime patch, haxie, hijack, LD_PRELOAD, system extension; the noun changes every few years, so perhaps it’s time for a new one. Override?
But the comment I replied to wasn’t talking about runtime patching or any of the other settings you mentioned. It was talking about changing GPU settings for specific programs. Not changing anything about the program itself.
That’s not what @torginus was referring to. There’s nothing wrong with having and exposing application specific settings. There’s nothing wrong with drivers having application specific optimization patches either, but that’s a very different thing.
For more context and deeper discussion on the subject, see https://news.ycombinator.com/item?id=44531107
Funnily, it's under an older submission of the same cutlass optimizations.
This is weirdly common; phone chipset manufacturers did it with phone benchmarks [0], VW with emissions [1], nVidia did it with 3DMark [2], Intel with the SPEC benchmark for its Xeon processors [3], etc.
When it comes to computer graphics, iirc it's pretty normalized now - graphics drivers all seem to have tweaks, settings, optimizations and workarounds for every game.
(As an aside, I hate that I have to link to archive.org, there's a lot of dead links nowadays but these are important things to remember).
[0] https://web.archive.org/web/20250306120819/https://www.anand...
[1] https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal
[2] https://web.archive.org/web/20051218120547/http://techreport...
[3] https://www.servethehome.com/impact-of-intel-compiler-optimi...
> When it comes to computer graphics, iirc it's pretty normalized now - graphics drivers all seem to have tweaks, settings, optimizations and workarounds for every game.
Even Mesa has them: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/uti...
> graphics drivers all seem to have tweaks, settings, optimizations and workarounds for every game.
Maybe hyperbole, but I think obviously they can't do this for literally every game, that would require huge personnel resources. At least looking at mesa (linked elsewhere), only ~200 games are patched, out of what 100k PC games? So <1%.
Goodhart's law: when a measure becomes a target, it ceases to be a good measure.
Are there any archives of that techreport article with images intact?
Ah yes they changed the site and URL system after some years, here is the OG one with screenshots
Page 1 https://web.archive.org/web/20071028172853/http://techreport...
Page 2 https://web.archive.org/web/20111130162817/http://techreport...
Page 3 https://web.archive.org/web/20080213212637/http://techreport...
Page 4 https://web.archive.org/web/20101110031431/http://techreport...
Page 5 https://web.archive.org/web/20101108144857/http://techreport...
I feel like omitting AMD is relevant here for anyone who doesn't know the acquisition history of ATI, AMD had no involvement with this.
I work with compilers
And despite it not being nice, some optimizations rely on type or function names schemas/substrings/etc
It sucks, but thats how it works
It doesnt have to be malicious just sometimes it is safer to deploy optimization only for your libs than risk breaking stuff
Or your frontend is not giving you more data which you can rely on
It is probably not malicious, but it certainly does create new barriers, which is not a good thing.
Something like:
if(AskLLM("Does function signature+name look like error handling code")) {
TurnOffInliner();
}
is actually probably a lot more effective than you'd think (generating PGO traces with a machine learning tool is apparently a thing that sort of works)Not exactly the same, but intrinsics in some languages are purely name+signature matched.
Sure, but will anyone name a method like "__nvidia_experimental_feature_xyz_v1"?
Yes.
E.g "__nvidia_experimental_feature_xyz_v1"
> than risk breaking stuff
... until somebody randomly chooses the same name for some reason and gets hosed.
You're not helping.