Show HN: Single-Header Profiler for C++17

utl::profiler is single-include solution for localized profiling, it features simple macros to measure how much time is taken by a certain scope / expression / code segment. Profiler automatically builds a call graph for all profiled functions and prints a nicely formatted table for every thread. See examples.

Key features:

Below is an output example from profiling a JSON parser:

// Profiling macros
UTL_PROFILER_SCOPE(label); UTL_PROFILER(label); UTL_PROFILER_BEGIN(segment, label);
UTL_PROFILER_END(segment); // Style options
struct Style { std::size_t indent = 2; bool color = true; double cutoff_red = 0.40; // > 40% of total runtime double cutoff_yellow = 0.20; // > 20% of total runtime double cutoff_gray = 0.01; // < 1% of total runtime
}; // Global profiler object
struct Profiler { void print_at_exit(bool value) noexcept; void upload_this_thread(); std::string format_results(const Style& style = Style{});
}; inline Profiler profiler;

UTL_PROFILER_SCOPE(label);

Attaches profiler to the current scope.

If profiled scope was entered at any point of the program, upon exiting main() a per-thread call graph will be built for all profiled segments.

Note 1: label is a string literal name that will be shown in the results table.

Note 2: Automatic printing on exit can be disabled.

Attaches profiler to the scope of the following expression.

Convenient to profile individual loops / function calls / ifs and etc.

UTL_PROFILER_BEGIN(segment, label);
UTL_PROFILER_END(segment);

Attaches profiler to the code section between two BEGIN/END macros with the same segment label.

struct Style { std::size_t indent = 2; bool color = true; double cutoff_red = 0.40; // > 40% of total runtime double cutoff_yellow = 0.20; // > 20% of total runtime double cutoff_gray = 0.01; // < 1% of total runtime
};

A struct with formatting settings for Profiler::format_results().

void Profiler::print_at_exit(bool value) noexcept;

Sets whether profiling results should be automatically printed after exiting from main(). true by default.

Note: This and all other profiler object methods are thread-safe.

void Profiler::upload_this_thread();

Uploads profiling results from the current thread to the profiler object.

Can be used to upload results from detached threads. Otherwise results are automatically uploaded once detached thread joins another one.

std::string Profiler::format_results(const Style& style = Style{});

Formats profiling results to a string using given style options.

inline Profiler profiler;

Global profiler object.

Note

Online compiler explorer may be a little weird when it comes to sleep & time measurement precision.

[ Run this code ]

using namespace std::chrono_literals; void computation_1() { std::this_thread::sleep_for(300ms); }
void computation_2() { std::this_thread::sleep_for(200ms); }
void computation_3() { std::this_thread::sleep_for(400ms); }
void computation_4() { std::this_thread::sleep_for(600ms); }
void computation_5() { std::this_thread::sleep_for(100ms); } // ... // Profile a scope
UTL_PROFILER_SCOPE("Computation 1 - 5");
computation_1();
computation_2(); // Profile an expression
UTL_PROFILER("Computation 3") computation_3(); // Profile a code segment
UTL_PROFILER_BEGIN(comp_45, "Computation 4 - 5");
computation_4();
computation_5();
UTL_PROFILER_END(comp_45);

Output:

[ Run this code ]

void recursive(int depth = 0) { if (depth > 4) { std::this_thread::sleep_for(std::chrono::milliseconds(50)); return; } UTL_PROFILER("1st recursion branch") recursive(depth + 1); UTL_PROFILER("2nd recursion branch") recursive(depth + 2);
} // ... recursive();

Output:

Note

In this example we will use utl::parallel to represent a parallel section concisely.

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals; // Run loop on the main thread
UTL_PROFILER("Single-threaded loop")
for (int i = 0; i < 30; ++i) std::this_thread::sleep_for(10ms); // Run the same loop on 3 threads
parallel::set_thread_count(3); UTL_PROFILER("Multi-threaded loop")
parallel::for_loop(parallel::IndexRange{0, 30}, [](int low, int high){ UTL_PROFILER("Worker thread loop") for (int i = low; i < high; ++i) std::this_thread::sleep_for(10ms);
}); parallel::set_thread_count(0);

Output:

Note

In this example we will use utl::parallel to represent detached section concisely.

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals; parallel::set_thread_count(2); // Detached task
UTL_PROFILER("Uploading task 1")
parallel::task([]{ UTL_PROFILER("Detached task 1: part 1") std::this_thread::sleep_for(700ms);
}); // Detached task with explicit result upload
UTL_PROFILER("Uploading task 2")
parallel::task([]{ UTL_PROFILER("Detached task 2: part 1") std::this_thread::sleep_for(50ms); UTL_PROFILER("Detached task 2: part 2") std::this_thread::sleep_for(50ms); // Manually upload results to the main thread, // otherwise results get collected once the thread joins profiler::profiler.upload_this_thread(); UTL_PROFILER("Detached task 2: part 3") std::this_thread::sleep_for(500ms);
}); // Wait a little so the 2nd task has time to reach manual upload
UTL_PROFILER("Waiting for task 2 to be partially done")
std::this_thread::sleep_for(200ms); // Format results explicitly
profiler::profiler.print_at_exit(false);

std::cout << profiler::profiler.format_results();

Output:

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals; // Profile something
UTL_PROFILER("Loop")
for (int i = 0; i < 10; ++i) { UTL_PROFILER("1st half of the loop") std::this_thread::sleep_for(10ms); UTL_PROFILER("2nd half of the loop") std::this_thread::sleep_for(10ms);
} // Disable automatic printing
profiler::profiler.print_at_exit(false); // Disable colors, remove indent, format to string
profiler::Style style;
style.color = false;
style.indent = 0; const std::string results = profiler::profiler.format_results(style); // Export to file & console
std::ofstream("profiling_results.txt") << results;
std::cout                              << results;

Output:

-------------------- UTL PROFILING RESULTS ---------------------

# Thread [main] (reuse 0) (running) (runtime -> 201.81 ms)
 - 99.99%  | 201.79 ms |                 Loop | example.cpp:8, main()  |
 - 49.91%  | 100.73 ms | 1st half of the loop | example.cpp:10, main() |
 - 50.07%  | 101.04 ms | 2nd half of the loop | example.cpp:11, main() |

By far the most significant part of profiling overhead comes from calls to std::chrono::steady_clock::now().

It is possible to significantly reduce that overhead by using CPU-counter intrinsics. To do so simply define UTL_PROFILER_USE_INTRINSICS_FOR_FREQUENCY macro with a need frequency:

#define UTL_PROFILER_USE_INTRINSICS_FOR_FREQUENCY 3.3e9 // 3.3 GHz (AMD Ryzen 5 5600H)
#include "UTL/profiler.hpp" // will now use 'rdtsc' for timestamps

This is exceedingly helpful when profiling code on a hot path. Below are a few benchmarks showcasing the difference on particular hardware:

======= USING std::chrono ========

| relative |               ms/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |                3.46 |              289.22 |    0.1% |      0.44 | `Runtime without profiling`
|    53.9% |                6.41 |              155.90 |    0.3% |      0.77 | `Theoretical best std::chrono profiler`
|    52.2% |                6.62 |              151.07 |    0.2% |      0.80 | `UTL_PROFILER()`

// very light workload - just 8 computations of 'std::cos()' per 2 time measurements, difficult to
// time and sensitive to overhead, here profiled code is ~2x slower then the non-profiled workload

====== USING __rdtsc() ======

| relative |               ms/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |                3.50 |              286.11 |    0.6% |      0.43 | `Runtime without profiling`
|    86.3% |                4.05 |              247.01 |    0.2% |      0.49 | `Theoretical best __rdtsc() profiler`
|    73.7% |                4.74 |              210.97 |    0.3% |      0.57 | `UTL_PROFILER()`

// notable reduction in profiling overhead

Note

Here "theoretical best" refers to a hypothetical profiler that requires zero operations aside from measuring the time at two points — before and after entering the code segment.

To disable any profiling code from interfering with the program, simply define UTL_PROFILER_DISABLE before including the header:

#define UTL_PROFILER_DISABLE
#include "UTL/profiler.hpp"
// - the header is now stripped of any and all code and only provides no-op mocks of the public API,
// this means no effectively no impact on compile times
// - 'profiler.format_results()' now returns "<profiling is disabled>"

A simple & naive way to construct a call graph would be through building a tree of nodes using std::unordered_map<std::string, Node> with call-site as a key. Such approach however makes the overhead of tree expansion & traversal incredibly high, rendering profiler useless for small tasks.

This library uses a bunch of thread_local variables (created by macros) to correlate call-sites with integer IDs and reduces tree traversal logic to traversing a "network" of indices encoded as a dense $M \times N$ matrix where $M$ — number of call-sites visited by this thread, $N$ — number of nodes in the call graph.

There are some additional details & arrays, but the bottom-line is that by associating everything we can with linearly growing IDs and delaying "heavy" things as much as possible until thread destruction / formatting, we can reduce almost all common operations outside of time measurement to trivial integer array lookups.

This way, the cost of re-entry on existing call graph nodes (aka the fast path taken most of the time) is reduced down to a single array lookup & branch that gets predicted most of the time.

New call-site entry & new node creation are rare slow paths, they only happen during call-graph expansion and will have very little contribution to the runtime outside of measuring very deep recursion. By using an std::vector-like allocation strategy for both rows & columns it is possible to make reallocation amortized $O(1)$.

Memory overhead of profiling is mostly defined by the aforementioned call graph matrix. For example, on thread that runs into 20 profiling macros and creates 100 nodes, memory overhead is going to be 8 kB. A thread that runs into 100 profiling macros and creates 500 call graph nodes, memory overhead will be 0.2 MB.

It is possible to further reduce memory overhead (down to 4 kB and 0.1 MB) by defining a UTL_PROFILER_USE_SMALL_IDS macro before the include:

#define UTL_PROFILER_USE_SMALL_IDS
#include "UTL/profiler.hpp"

This switches implementation to 16-bit IDs, which limits the max number of nodes to 65535. For most practical purposes this should be more than enough as most machines will reach stack overflow far before reaching such depth of the call graph.

Almost all profiling is lock-free, there are only 3 points at which implementation needs to lock a mutex:

When creating a new thread
When joining a thread
When manually calling profiler.upload_this_thread()

All public API is thread-safe.

Hacker News

Show HN: Single-Header Profiler for C++17

Show article

GeorgeHaldane

Comments

By dustbunny 2025-04-1514:35

By gurkwart 2025-04-1511:512 reply

By dustbunny 2025-04-1514:19

By GeorgeHaldane 2025-04-1522:48

By bogwog 2025-04-1421:491 reply

By GeorgeHaldane 2025-04-150:12

HackerNews