
This one is going to be a quick one as there wasn't anything new discovered. In fact, I feel quite dumb. This is really a tale of "Do your research before acting and know what your goal is," as you'll…
This one is going to be a quick one as there wasn't anything new discovered. In fact, I feel quite dumb. This is really a tale of "Do your research before acting and know what your goal is," as you'll end up saving yourself a lot of time. Nobody likes throwing away work they've done either, and there could be something here that is valuable for someone else.
I still can't escape PSRayTracing. No matter how hard I try to shelve that project, every once in a while I hear about something "new" and then wonder "can I shove this into the ray tracer and wring a few more seconds of speed out of it?" This time around it was Padé Approximants. The target is to provide me with faster (and better) trig approximations.
Short answer: "no". It did not help.
But... I found something that ended up making the ray tracer significantly faster!!
In any graphics application trigonometric functions are frequently used. But that can be a little expensive in terms of computational time. While it's nice to be accurate, we usually care more about fast if anything. So if we can find an approximation that's "good enough" and is speedier than the real thing, it's generally okay to use it instead.
When it comes to texturing objects in PSRayTracing (which is based off of the Ray Tracing in One Weekend books, circa the 2020 edition), the std::asin() function is used. When profiling some of the scenes, I noticed that a significant amount of calls were made to that function, so I thought it was worth trying to find an approximation.
In the end I ended up writing my own Taylor series based approximation. It is faster but also has a flaw, whenever the input x was less than -0.8 or greater than 0.8 it would deviate heavily. So to look correct, it had to fall back to std::asin() past these bounds.
double _asin_approx_private(const double x) { // This uses a Taylor series approximation. // See: http://mathforum.org/library/drmath/view/54137.html // // In the case where x=[-1, -0.8) or (0.8, 1.0] there is unfortunately a lot of // error compared to actual arcsin, so for this case we actually use the real function if ((x < -0.8) || (x > 0.8)) { return std::asin(x); } // The taylor series approximation constexpr double a = 0.5; constexpr double b = a * 0.75; constexpr double c = b * (5.0 / 6.0); constexpr double d = c * (7.0 / 8.0); const double aa = (x * x * x) / 3.0; const double bb = (x * x * x * x * x) / 5.0; const double cc = (x * x * x * x * x * x * x) / 7.0; const double dd = (x * x * x * x * x * x * x * x * x) / 9.0; return x + (a * aa) + (b * bb) + (c * cc) + (d * dd); }
After a bit of trial and error, I found that a fourth-order Taylor series was the most performant on my hardware. It was measurably faster (by +5%), so I kept it and moved onto the next optimization.

I can't remember where I heard about this one.... I'm drawing a complete blank. If you want a more in depth read, check the Wikipedia article. But in a quick nutshell, they are a mathematical tool they can help provide an approximation of an existing function. To compute one, you do need to start out with a Taylor (or Maclaurin) series. While PSRayTracing is mainly a C++ project, Python is going to be used for simplicity's sake; we'll go back to our favorite compiled language when it matters though.
For the arcsine approximation above, using the four-term Taylor, we have this in Python:
def taylor_fourth_order(x: float) -> float:
return x + (x**3)/6 + (3*x**5)/40 + (5*x**7)/112
Computing that into a Padé Approximant, we get what's known as a [3/4] Padé Approximant:
def asin_pade_3_4(x):
a1 = -367.0 / 714.0
b1 = -81.0 / 119.0
b2 = 183.0 / 4760.0
n = 1.0 + (a1 * x**2)
d = 1.0 + (b1 * x**2) + (b2 * x**4)
return x * (n / d)
I'm also going to provide the one from a 5th order Taylor Series as well, a [5/4] Padé Approximant:
def asin_pade_5_4(x):
a1 = -1709.0 / 2196.0
a2 = 69049.0 / 922320.0
b1 = -2075.0 / 2196.0
b2 = 1075.0 / 6832.0
n = 1.0 + (a1 * x**2) + (a2 * x**4)
d = 1.0 + (b1 * x**2) + (b2 * x**4)
return x * (n / d)
Now when charting all three, we get this:

Wow, that already looks much better. Less error! It's a bit hard to see that, so let's zoom in on right side of the functions:

The error hasn't fully gone away, but it's much less than before. Instead of defaulting back to the built-in asin() method, there's a better trick up our sleeves: leveraging Inverse Trig Functions/Half Angle Transforms. Look at this:

This does seem a tad confusing, yet it lets us do a pro gamer move. When |x| is past a value we can "teleport" from the edge of the function more towards the center of arcsin(), perform the computation, and then go back and use the result there. In Python, this is our new asin(x) approximation:
def asin_pade_3_4_half_angle_correction(x: float) -> float:
abs_x = abs(x)
# If past the range, then we can use the half angle transformation to account for error
if abs_x > 0.85:
small_x = math.sqrt(0.5 * (1.0 - abs_x))
r = (math.pi / 2) - (2.0 * asin_pade_3_4(small_x))
return -r if x < 0 else r
# Within the border we can just use the 3/4 approximation like normal
return asin_pade_3_4(x)
Now with the correction in place, the edges look more like this:

It might be a little hard to see, but the dashed lines are the ones with this half angle transform correction method and they are hugging the y=0 line. There is a tiny bit of error if you zoom in.
There's even a further optimization that could be had: use (and adapt) a [1/2] Padé on the inside of the if body. This is because small_x will always be less than the square root of 0.075 (which is ~0.27). The [1/2] Padé approximant for asin() can compute much faster, but only for smaller values of x. It can even be inlined into our function for further optimization. See below:
def asin_pade_1_2(x):
b1 = -1.0 / 6.0
d = 1.0 + (b1 * x**2)
return (x / d)
# ...
def asin_pade_3_4_half_angle_correction(x: float) -> float:
abs_x = abs(x)
# If past the range, then we can use the half angle transformation to account for error along with a "smaller" Pade
if abs_x > 0.85:
z = (1.0 - abs_x) / 2
b1 = -1.0 / 6.0
d = 1.0 + (b1 * z)
small_pade = math.sqrt(z) / d
r = (math.pi / 2) - (2.0 * small_pade)
return -r if x < 0 else r
# Within the border we can just use the 3/4 approximation like normal
return asin_pade_3_4(x)
It still looks the same as the above chart, so I don't think it's necessary to include another one. Written as C++, we have this:
constexpr double HalfPi = 1.5707963267948966;
inline double asin_pade_3_4(const double x)
{
constexpr double a1 = -367.0 / 714.0;
constexpr double b1 = -81.0 / 119.0;
constexpr double b2 = 183.0 / 4760.0;
const double x2 = x * x;
const double n = 1.0 + (a1 * x2);
const double d = 1.0 + (b1 * x2) + (b2 * x2 * x2);
return x * (n / d);
}
double asin_pade_3_4_half_angle_correction(const double x)
{
const double abs_x = std::abs(x);
if (abs_x <= 0.85)
{
return asin_pade_3_4(x);
}
else
{
// Edges of Pade curve
const double z = 0.5 * (1.0 - abs_x);
constexpr double b1 = -1.0 / 6.0;
const double d = 1.0 + (b1 * z);
const double pade_result = std::sqrt(z) / d;
const double r = HalfPi - (2.0 * pade_result);
return std::copysign(r, x);
}
}
Compared to the original approximation method, this is more complicated, but it has benefits:
std::asin()I'm very much a "put up or shut up" type of person. So let's actually plug it back into PSRayTracing and see if there is a speed improvement. We'll use the default scene (which is the final render from book 2):

That globe is the user of asin(). All images generally look the same (minus a little fuzz). For the test case we will render a 1080p image, with 250 samples per pixel, and take up a few cores. The testing was done on an M4 Mac Mini (running a version of macOS Tahoe, using GCC15 compiled with -O3). Doing a few runs each, taking a median:
With std::asin() it took about 111 seconds to render the scene:
ben@Benjamins-Mac-mini build_gcc15 % ./PSRayTracing -j 4 -n 250 -s 1920x1080 -o render_std_asin.png Scene: book2::final_scene Render size: 1920x1080 Samples per pixel: 250 Max number of ray bounces: 50 Number of render threads: 4 Copy per thread: on Saving to: render_std_asin.png Seed: `ASDF` Rendering: [==================================================] 100% 111s Render took 110.891 seconds
The older asin() approximation took roughly 105 seconds (~5% speedup):
ben@Benjamins-Mac-mini build_gcc15 % ./PSRayTracing -j 4 -n 250 -s 1920x1080 - o render_asin_taylor.png ... Render took 104.674 seconds
And this new Padé Approximant approach took... The same amount of time:
ben@Benjamins-Mac-mini build_gcc_15 % ./PSRayTracing -j 4 -n 250 -s 1920x1080 -o render_asin_pade.png ... Render took 104.87 seconds
This new asin() method is better (in terms of correctness), but not much of a contest when it comes to performance. It's a small victory.
There's no denying the AI tooling of the past 3+ years is going to go away. At CppCon 2023 Andrei Alexandrescu gave an excellent talk entitled "Robots Are After Your Job: Exploring Generative AI for C++"; it's worth your time to listen. After all of the above work and that talk in mind, I decided to ask an LLM (Gemini in this case): "What's a fast approximation for asin(x) I can use in C++?". It gave me this:
constexpr double HalfPi = 3.1415926535897932385 / 2.0;
double fast_asin_cg(const double x)
{
// Original Minimax coefficients
constexpr double a0 = 1.5707288;
constexpr double a1 = -0.2121144;
constexpr double a2 = 0.0742610;
constexpr double a3 = -0.0187293;
// Strip sign
const double abs_x = fabs(x);
// Evaluate polynomial using Horner's method
double p = a3 * abs_x + a2;
p = p * abs_x + a1;
p = p * abs_x + a0;
// Apply sqrt term and pi/2 offset
const double x_diff = sqrt(1.0 - abs_x);
const double result = HalfPi - (x_diff * p);
// Restore sign
return copysign(result, x);
}
This looked too simple for me to believe. The source it cited was the documentation for asin in Nvidia's Cg Toolkit; a product that hasn't been updated since 2012. The reference implementation contained a rewritten form of the above. Porting it over to Python/Jupyter to verify it is trivial:
def asin_cg(x: float) -> float:
'''
Fast branchless asin(x) approximation.
Based on Abramowitz and Stegun formula 4.4.45
'''
# https://developer.download.nvidia.com/cg/asin.html
# https://personal.math.ubc.ca/~cbm/aands/page_81.htm
# Original Minimax coefficients from Abramowitz and Stegun
a0 = 1.5707288
a1 = -0.2121144
a2 = 0.0742610
a3 = -0.0187293
abs_x = abs(x)
# Evaluate polynomial using Horner's method
p = a3 * abs_x + a2
p = p * abs_x + a1
p = p * abs_x + a0
result = (math.pi / 2) - math.sqrt(1.0 - abs_x) * p
# Restore sign natively
return math.copysign(result, x)
I was in disbelief that it was so clean and elegant. The implementation, error, and output. Look for yourself

That curve; it overlaps the arcsin() function without any visible difference. And the error is practically nothing. Though the real test would be in the ray tracer itself:
ben@Benjamins-Mac-mini build_gcc_15_new_asin_cg % ./PSRayTracing -j 4 -n ... Render took 101.462 seconds
Wow, This is considerably faster than any other methods. After verifying the render vs std::asin()'s output, it's indistinguishable. A better asin() implementation was found.
This led me down a small rabbit hole of benchmarking this implementation on a few select chips and operating systems.
Intel i7-10750H, Ubuntu 24.04 ( w/ GCC 14 and clang 19):
./test_gcc_O3 "ASDF" "100" "10000000" std::asin() time: 29197.9 ms asin_cg() time: 19839.8 ms Verification sums: std::asin(): -34549.5 asin_cg(): -34551.1 Difference: 1.60886 Error: -0.00465669 % Speedup: 1.47169x faster ./test_clang_O3 "ASDF" "100" "10000000" std::asin() time: 29520.7 ms asin_cg() time: 19044.3 ms ... Speedup: 1.55011x faster
Intel i7-10750H, Windows 11 (w/ MSVC 2022):
C:\Users\Benjamin\Projects\PSRayTracing\experiments\asin_cg_approx>test_msvc_O2.exe ASDF 100 10000000 std::asin() time: 12458.1 ms asin_cg() time: 6562.1 ms ... Speedup: 1.8985x faster
Apple M4, macOS Tahoe (w/ GCC 15 via Homebrew and clang 17):
./test_gcc_O3 "ASDF" "100" "10000000" std::asin() time: 10469 ms asin_cg() time: 10251 ms ... Speedup: 1.02126x faster ./test_clang_O3 "ASDF" "100" "10000000" std::asin() time: 12650 ms asin_cg() time: 12073.2 ms ... Speedup: 1.04777x faster
All of them have this CG asin() approximation well in the lead. On the Intel chip it's faster by a very significant margin. I'm curious to test this on an AMD based x86_64 system, but I'll leave that up to any readers. My guess is that it's just as good. The Apple M4 chip didn't have much as a boost, but it's still measurable (and reproducible). Anything greater than a 2% change is notable. I refer to Nicholas Ormrod's old talk on this matter.
I think I originally went down the Taylor series based rabbit hole because I started trying that out with sin() and cos(), then naturally assumed I could apply it to other trig functions. I never thought to just first see if someone had solved my problem: a faster arcsine for computer graphics.
And here's the worst part: this all existed before LLMs were even available. I can't seem to recreate it, but there was a combination of the words "fast c++ asin approximation cg" that I queried into a search engine. The first result was a link to the Nvidia Cg Toolkit doc page. I only found this a few days ago.
I am surprised that no one else mentioned anything to me either. I even highlighted my faster asin() in the README as an achievement and no one bothered to correct me... I know this project (and these articles) have made the rounds in both C++ and computer graphics circles. People way more experienced and senior than me never said a thing.
This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s. It is annoying too when I tried to perform a search that no benchmarks were provided. Hopefully the word is out now.
I think my main problem is that I never bothered to slow down, double check what my goal was, and see if someone else already figured it out. That's what I gained from this experience.
And some fancy charts.
While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.
Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.
I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.
Thanks so much for that!
I've been struggling to curve fit an aerodynamics equation relating Mach number to rocket nozzle exit/entrance area ratio for quite some time. It's a 5th or 6th degree polynomial whose inverse doesn't have a closed-form solution:
https://en.wikipedia.org/wiki/Abel%E2%80%93Ruffini_theorem
But I was able to use a Chebyshev fit that is within a few percent accurate at 3rd degree, and is effectively identical at 4th degree or higher. And 4th degree (quartic) polynomials do have a closed-form solution for their inverse. That lets me step up to higher abstractions for mass flow rate, power, etc without having to resort to tables. At most, I might need to use piecewise-smooth sections, which are far easier to work with since they can just be dropped into spreadsheets, used for derivatives/integrals, etc.
Anyway, I also discovered (ok AI mentioned) that the Chebyshev approximation is based on the discrete cosine transform (DCT):
https://en.wikipedia.org/wiki/Discrete_Chebyshev_transform#R...
https://en.wikipedia.org/wiki/Discrete_cosine_transform#Appl...
That's why it's particularly good at curve-fitting in just 2 or 3 terms. Which is why they use the DCT for image compression in JPG etc:
https://www.mathworks.com/help/images/discrete-cosine-transf...
The secret sauce is that Chebyshev approximation spreads the error as ripples across the function, rather than at its edges like with Taylor series approximation. That helps it fit more intricate curves and arbitrary data points, as well as mesh better with neighboring approximations.
I had no idea, but this "wiggle" is required for an optimal approximation, it's called the "equioscillation property" [https://en.wikipedia.org/wiki/Equioscillation_theorem].
For a polynomial P (of degree n) to approximate a function F on the real numbers with minimal absolute error, the max error value of |P - F| needs to be hit multiple times, (n+2 times to be precise). You need to have the polynomial "wiggle" back and forth between the top of the error bound and the bottom.
And even more surprisingly, this is a necessary _and sufficient_! condition for optimality. If you find a polynomial whose error alternates and it hits its max error bound n+2 times, you know that no other polynomial of degree n can do better, that is the best error bound you can get for degree n.
Very cool!
Chebyshev polynomials cos(n arcos(x)) provide one of the proofs that every continuous function f:[0,1]->R can be uniformly approximated by polynomial functions. Bernstein polynomials provide a shorter proof, but perhaps not the best numerical method: https://en.wikipedia.org/wiki/Bernstein_polynomial#See_also
Those don't guarantee that that they can be well approximated by a polynomial of degree N though, like we have here. You can apply Jackson's inequality to calculate a maximum error bound, but the epsilon for degree 5 is pretty atrocious.
To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.
I ran this down, because I have a particular interest in vectorizable function approximations. Particular those that exploit bit-banging to handle range normalization. (Anyone have a good reference for that?)
Regrettably, this is NOT from Hastings 1955. Hastings provides Taylor series and Chebyshev polynomial approximations. The OP's solution is a Pade approximation, which are not covered at all in Hastings.
When you say "this is NOT from Hastings" I had to double check my post again -- I guess you are saying that the Pade approximation is not from Hastings, but the polynomial approximation that the OP referenced from nvidia from A&S and ultimately from Hastings, definitely is in Hastings on page 159 -- I think you were referring to the Pade approximation not being in Hastings, which appears to be true yes. In the article it is interesting that the OP tried taylor expansion and pade approximation, but not the fairly standard "welp lets just fit a Nth order polynomial to the arcsin" which is what Hastings did back in the day.
>Particular those that exploit bit-banging to handle range normalization
https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReductio...
That's tiny but weird.
This line:
> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.
was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."
One of the ways that the classics can be improved is not to take the analytic ideal coefficients and approximate them to the closest floating point number, but rather take those ideal coefficients as a starting point for a search of slightly better ones.
The SLEEF Vectorized Math Library [1] does this and therefore can usually provide accuracy guarantees for the whole floating point range with a polynomial order lower than theory would predict.
Its asinf function [2] is accurate to 1 ULP for all single precision floats, and is similar to the `asin_cg` from the article, with the main difference the sqrt is done on the input of the polynomial instead of the output.
[1] https://sleef.org/ [2] https://github.com/shibatch/sleef/blob/master/src/libm/sleef...
I'm sorry, that second reference was actually for the 3.5ULP variant. The 1 ULP is here: https://github.com/shibatch/sleef/blob/master/src/libm/sleef...
Abramowitz/Stegun has been updated 2010 and resides now here: https://dlmf.nist.gov/
Doesn't seem to be terribly up to date though. It seems to use almost exclusively taylor series, and seems to be completely uninterested in error analysis of any kind. Unless I'm missing something.
It's a general-purpose reference for mathematicians, not specifically for numerical analysis. Mathematicians are usually interested in the boring old power series centered at zero (Maclaurin series), so that's what gets prominence.
These are books that my uni courses never had me read. I'm a little shocked at times at how my degree program skimped on some of the more famous texts.
It is not a textbook, it is an extremely dense reference manual, so that honestly makes sense.
In physics grad school, professors would occasionally allude to it, and textbooks would cite it ... pretty often. So it's a thing anyone with postgraduate physics education should know exists, but you wouldn't ever be assigned it.
Presumably someone read it though, at some point, in order to be able to cite it.
The relevant sections, at any rate
I didn't need Abramowitz and Stegun until grad school. In the 1990s. It was a well-known reference book for people at that level, not a text book.
For my undergrad the CRC math handbook was enough.
In this case, AI understood history better than the human.
Yeah, if you want something that's somewhat obscure, pull up Cody and Waite "Software Manual for the Elementary Functions".
And, lo and behold, the ASIN implementation is minimax.