Faster Argmin on Floats

2025-09-1816:202211algorithmiker.github.io

Consider the following problem: you are given a dynamically large array of NNN floating point numbers, and you are asked to find the index of the smallest one (commonly called arg min⁡\argminargmin)…

Consider the following problem: you are given a dynamically large array of NNN floating point numbers, and you are asked to find the index of the smallest one (commonly called arg min\argminargmin) as fast as possible. But there’s a catch: you know that all values in the list are positive or +0, non-infinity and non-NaN.

First solution

The first solution that comes to mind for this, in Rust, is:

let argmin = data.iter().enumerate().min_by(|a, b| a.1.total_cmp(b.1));

For a million numbers, this runs in around 511 us. Not bad. However, we can do better, by using what we know about the data itself.

Second solution

When optimizing this, my first instinct was to implement our own comparator function, using the natural partial order of floats. This will of course return wrong values for other cases.

let argmin = data.iter().enumerate()
 .reduce(|a, b| if a.1 < b.1 { a } else { b });

The runtime here is 489 us for a million numbers.

Third solution

A different, easy solution would be to use the partial comparison function in the rust standard library, and return Equal if the two values cannot be partially compared (which should be never in our case).

let argmin = data.iter().enumerate()
 .min_by(|a, b| a.1.partial_cmp(b.1).unwrap_or(Ordering::Equal));

This turns out to be slightly faster compared to our second solution; for our 10610^6106-number benchmark, this runs in 470 us.

I suspect this is because the compiler can better optimize the code.

Fourth solution

We are also told that the list contains only positive numbers. Based on this, we can use a very elegant property of floating point representation: you can sort the f32 values as u32, if you have only positive numbers. This is also why you can do radix sort on floats.

let argmin = data.iter().enumerate().min_by_key(|a| (a.1.to_bits()));

For positive numbers, this is blazingly fast: it takes only 370 us, providing a 30% speedup over baseline.

For more information, I recommend reading the answer in https://stackoverflow.com/a/59349481.

Benchmark program

use criterion::{Criterion, criterion_group, criterion_main};
use std::{cmp::Ordering, hint::black_box};

fn generate_data(n: usize) -> Vec<f32> {
 let half = n / 2;
 let mut vec = Vec::with_capacity(n * 2);
 vec.extend(
 (0..=half)
 .rev()
 .chain(0..=half)
 .map(|x| (2 * x + 1) as f32 / 2.0),
 );
 vec
}
const N: usize = 1_000_000;

fn bench_normal_min(c: &mut Criterion) {
 let data = generate_data(N);
 c.bench_function("normal_min", |b| {
 b.iter(|| {
 let argmin = data.iter().enumerate().min_by(|a, b| a.1.total_cmp(b.1));
 black_box(argmin);
 })
 });
}

fn bench_reduce_min(c: &mut Criterion) {
 let data = generate_data(N);
 c.bench_function("reduce_min", |b| {
 b.iter(|| {
 let min = data
 .iter()
 .enumerate()
 .reduce(|a, b| if a.1 < b.1 { a } else { b });
 black_box(min);
 })
 });
}

fn bench_partial_min(c: &mut Criterion) {
 let data = generate_data(N);
 c.bench_function("partial_unwrap_min", |b| {
 b.iter(|| {
 let argmin = data
 .iter()
 .enumerate()
 .min_by(|a, b| a.1.partial_cmp(b.1).unwrap_or(Ordering::Equal));
 black_box(argmin);
 })
 });
}

fn bench_u32_min_positive(c: &mut Criterion) {
 let data = generate_data(N);
 c.bench_function("u32_min_positive", |b| {
 b.iter(|| {
 let min = data.iter().enumerate().min_by_key(|a| (a.1.to_bits()));
 black_box(min);
 })
 });
}

criterion_group!(
 benches,
 bench_normal_min,
 bench_reduce_min,
 bench_partial_min,
 bench_u32_min_positive,
);
criterion_main!(benches);

Read the original article

Comments

  • By why_only_15 2025-09-204:25

    This trick is very useful on Nvidia GPUs for calculating mins and maxes in some cases, e.g. atomic mins (better u32 support than f32) or warp-wide mins with `redux.sync` (only supports u32, not f32).

  • By teo_zero 2025-09-205:511 reply

    I had expected something about algorithms, not Rust-specific implementations.

    • By why_only_15 2025-09-206:30

      doing a u32 compare instead of an f32 compare is not rust-specific or indeed CPU-specific.

  • By TheDudeMan 2025-09-204:361 reply

    How fast if you write a for loop and keep track of the index and value of the smallest (possibly treating them as ints)?

    • By nine_k 2025-09-205:044 reply

      I hazard to guess that it would be the same, because the compiler would produce a loop out of .iter(), would expose the loop index via .enumerate(), and would keep track of that index in .min_by(). I suppose the lambda would be inlined, maybe even along with comparisons.

      I wonder could that be made faster by using AVX instructions; they allow to find the minimum value among several u32 values, but not immediately its index.

      • By shoo 2025-09-208:00

        Even without AVX it seems possible to do better than a naive C style for loop argmax by manually unrolling the loop a bit and maintaining multiple accumulators

        e.g. using 4 accumulators instead of 1 accumulator in the naive for loop gives me around a 15%-20% speedup (Not using rust, extremely scalar terrible naive C code via g++ with -funroll-all-loops -march=native -O3)

        if we're expressing argmax via the obvious C style naive for loop, or a functional reduce, with a single accumulator, we've forcing a chain dependency that isn't really part of the problem. but if we don't care which argmax-ing index we get (if there are multiple minimal elements in the array) then instead of evaluating the reductions in a single rigid chain bound by a single accumulator, we can break the chain and get our hardware to do more work in parallel, even if we're only single threaded.

        anonymoushn is doing something much cleverer again using intrinsics but there's still that idea of "how do we break the dependency chain between different operations so the cpu can kick them off in parallel"

      • By anonymoushn 2025-09-207:05

        you can have some vector registers n_acc, ns, idx_acc, idxs, then you can do

          // (initialize ns and idxs by reading from the array
          //  and adding the apropriate constant to the old value of idxs.)
          n_acc = min(n_acc, ns);
          const is_new_min = eq(n_acc, ns);
          idx_acc = blend(idx_acc, idxs, is_new_min);
        
        Edit: I wrote this with min, eq, blend but you can actually use cmpgt, min, blend to avoid having a dependency chain through all three instructions. I am just used to using min, eq, blend because of working on unsigned values that don't have cmpgt

        you can consult the list of toys here: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

      • By TinkersW 2025-09-2010:101 reply

        Yes this is fairly easy to write in AVX, and you can track the index also, honestly the code is cleaner and nicer to read than this mildly obfuscated rust.

        • By imtringued 2025-09-2010:25

          You're referring to nothing and nothing. What exactly are you talking about? It certainly can't be the trivial to understand one liners in the blog.

      • By TheDudeMan 2025-09-2022:32

        But how is that slower than sorting the list?!

HackerNews