Show HN: Single-Header Profiler for C++17

2025-04-1412:166811github.com

Collection of self-contained header-only libraries for C++17 - DmitriBogdanov/UTL

<- to README.md

<- to implementation.hpp

utl::profiler is single-include solution for localized profiling, it features simple macros to measure how much time is taken by a certain scope / expression / code segment. Profiler automatically builds a call graph for all profiled functions and prints a nicely formatted table for every thread. See examples.

Key features:

Below is an output example from profiling a JSON parser:

// Profiling macros
UTL_PROFILER_SCOPE(label); UTL_PROFILER(label); UTL_PROFILER_BEGIN(segment, label);
UTL_PROFILER_END(segment); // Style options
struct Style { std::size_t indent = 2; bool color = true; double cutoff_red = 0.40; // > 40% of total runtime double cutoff_yellow = 0.20; // > 20% of total runtime double cutoff_gray = 0.01; // < 1% of total runtime
}; // Global profiler object
struct Profiler { void print_at_exit(bool value) noexcept; void upload_this_thread(); std::string format_results(const Style& style = Style{});
}; inline Profiler profiler;
UTL_PROFILER_SCOPE(label);

Attaches profiler to the current scope.

If profiled scope was entered at any point of the program, upon exiting main() a per-thread call graph will be built for all profiled segments.

Note 1: label is a string literal name that will be shown in the results table.

Note 2: Automatic printing on exit can be disabled.

Attaches profiler to the scope of the following expression.

Convenient to profile individual loops / function calls / ifs and etc.

UTL_PROFILER_BEGIN(segment, label);
UTL_PROFILER_END(segment);

Attaches profiler to the code section between two BEGIN/END macros with the same segment label.

struct Style { std::size_t indent = 2; bool color = true; double cutoff_red = 0.40; // > 40% of total runtime double cutoff_yellow = 0.20; // > 20% of total runtime double cutoff_gray = 0.01; // < 1% of total runtime
};

A struct with formatting settings for Profiler::format_results().

void Profiler::print_at_exit(bool value) noexcept;

Sets whether profiling results should be automatically printed after exiting from main(). true by default.

Note: This and all other profiler object methods are thread-safe.

void Profiler::upload_this_thread();

Uploads profiling results from the current thread to the profiler object.

Can be used to upload results from detached threads. Otherwise results are automatically uploaded once detached thread joins another one.

std::string Profiler::format_results(const Style& style = Style{});

Formats profiling results to a string using given style options.

inline Profiler profiler;

Global profiler object.

Note

Online compiler explorer may be a little weird when it comes to sleep & time measurement precision.

[ Run this code ]

using namespace std::chrono_literals; void computation_1() { std::this_thread::sleep_for(300ms); }
void computation_2() { std::this_thread::sleep_for(200ms); }
void computation_3() { std::this_thread::sleep_for(400ms); }
void computation_4() { std::this_thread::sleep_for(600ms); }
void computation_5() { std::this_thread::sleep_for(100ms); } // ... // Profile a scope
UTL_PROFILER_SCOPE("Computation 1 - 5");
computation_1();
computation_2(); // Profile an expression
UTL_PROFILER("Computation 3") computation_3(); // Profile a code segment
UTL_PROFILER_BEGIN(comp_45, "Computation 4 - 5");
computation_4();
computation_5();
UTL_PROFILER_END(comp_45);

Output:

[ Run this code ]

void recursive(int depth = 0) { if (depth > 4) { std::this_thread::sleep_for(std::chrono::milliseconds(50)); return; } UTL_PROFILER("1st recursion branch") recursive(depth + 1); UTL_PROFILER("2nd recursion branch") recursive(depth + 2);
} // ... recursive();

Output:

Note

In this example we will use utl::parallel to represent a parallel section concisely.

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals; // Run loop on the main thread
UTL_PROFILER("Single-threaded loop")
for (int i = 0; i < 30; ++i) std::this_thread::sleep_for(10ms); // Run the same loop on 3 threads
parallel::set_thread_count(3); UTL_PROFILER("Multi-threaded loop")
parallel::for_loop(parallel::IndexRange{0, 30}, [](int low, int high){ UTL_PROFILER("Worker thread loop") for (int i = low; i < high; ++i) std::this_thread::sleep_for(10ms);
}); parallel::set_thread_count(0);

Output:

Note

In this example we will use utl::parallel to represent detached section concisely.

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals; parallel::set_thread_count(2); // Detached task
UTL_PROFILER("Uploading task 1")
parallel::task([]{ UTL_PROFILER("Detached task 1: part 1") std::this_thread::sleep_for(700ms);
}); // Detached task with explicit result upload
UTL_PROFILER("Uploading task 2")
parallel::task([]{ UTL_PROFILER("Detached task 2: part 1") std::this_thread::sleep_for(50ms); UTL_PROFILER("Detached task 2: part 2") std::this_thread::sleep_for(50ms); // Manually upload results to the main thread, // otherwise results get collected once the thread joins profiler::profiler.upload_this_thread(); UTL_PROFILER("Detached task 2: part 3") std::this_thread::sleep_for(500ms);
}); // Wait a little so the 2nd task has time to reach manual upload
UTL_PROFILER("Waiting for task 2 to be partially done")
std::this_thread::sleep_for(200ms); // Format results explicitly
profiler::profiler.print_at_exit(false);

std::cout << profiler::profiler.format_results();

Output:

[ Run this code ]

using namespace utl;
using namespace std::chrono_literals; // Profile something
UTL_PROFILER("Loop")
for (int i = 0; i < 10; ++i) { UTL_PROFILER("1st half of the loop") std::this_thread::sleep_for(10ms); UTL_PROFILER("2nd half of the loop") std::this_thread::sleep_for(10ms);
} // Disable automatic printing
profiler::profiler.print_at_exit(false); // Disable colors, remove indent, format to string
profiler::Style style;
style.color = false;
style.indent = 0; const std::string results = profiler::profiler.format_results(style); // Export to file & console
std::ofstream("profiling_results.txt") << results;
std::cout                              << results;

Output:

-------------------- UTL PROFILING RESULTS ---------------------

# Thread [main] (reuse 0) (running) (runtime -> 201.81 ms)
 - 99.99%  | 201.79 ms |                 Loop | example.cpp:8, main()  |
 - 49.91%  | 100.73 ms | 1st half of the loop | example.cpp:10, main() |
 - 50.07%  | 101.04 ms | 2nd half of the loop | example.cpp:11, main() |

By far the most significant part of profiling overhead comes from calls to std::chrono::steady_clock::now().

It is possible to significantly reduce that overhead by using CPU-counter intrinsics. To do so simply define UTL_PROFILER_USE_INTRINSICS_FOR_FREQUENCY macro with a need frequency:

#define UTL_PROFILER_USE_INTRINSICS_FOR_FREQUENCY 3.3e9 // 3.3 GHz (AMD Ryzen 5 5600H)
#include "UTL/profiler.hpp" // will now use 'rdtsc' for timestamps

This is exceedingly helpful when profiling code on a hot path. Below are a few benchmarks showcasing the difference on particular hardware:

======= USING std::chrono ========

| relative |               ms/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |                3.46 |              289.22 |    0.1% |      0.44 | `Runtime without profiling`
|    53.9% |                6.41 |              155.90 |    0.3% |      0.77 | `Theoretical best std::chrono profiler`
|    52.2% |                6.62 |              151.07 |    0.2% |      0.80 | `UTL_PROFILER()`

// very light workload - just 8 computations of 'std::cos()' per 2 time measurements, difficult to
// time and sensitive to overhead, here profiled code is ~2x slower then the non-profiled workload

====== USING __rdtsc() ======

| relative |               ms/op |                op/s |    err% |     total | benchmark
|---------:|--------------------:|--------------------:|--------:|----------:|:----------
|   100.0% |                3.50 |              286.11 |    0.6% |      0.43 | `Runtime without profiling`
|    86.3% |                4.05 |              247.01 |    0.2% |      0.49 | `Theoretical best __rdtsc() profiler`
|    73.7% |                4.74 |              210.97 |    0.3% |      0.57 | `UTL_PROFILER()`

// notable reduction in profiling overhead

Note

Here "theoretical best" refers to a hypothetical profiler that requires zero operations aside from measuring the time at two points — before and after entering the code segment.

To disable any profiling code from interfering with the program, simply define UTL_PROFILER_DISABLE before including the header:

#define UTL_PROFILER_DISABLE
#include "UTL/profiler.hpp"
// - the header is now stripped of any and all code and only provides no-op mocks of the public API,
// this means no effectively no impact on compile times
// - 'profiler.format_results()' now returns "<profiling is disabled>"

A simple & naive way to construct a call graph would be through building a tree of nodes using std::unordered_map<std::string, Node> with call-site as a key. Such approach however makes the overhead of tree expansion & traversal incredibly high, rendering profiler useless for small tasks.

This library uses a bunch of thread_local variables (created by macros) to correlate call-sites with integer IDs and reduces tree traversal logic to traversing a "network" of indices encoded as a dense $M \times N$ matrix where $M$ — number of call-sites visited by this thread, $N$ — number of nodes in the call graph.

There are some additional details & arrays, but the bottom-line is that by associating everything we can with linearly growing IDs and delaying "heavy" things as much as possible until thread destruction / formatting, we can reduce almost all common operations outside of time measurement to trivial integer array lookups.

This way, the cost of re-entry on existing call graph nodes (aka the fast path taken most of the time) is reduced down to a single array lookup & branch that gets predicted most of the time.

New call-site entry & new node creation are rare slow paths, they only happen during call-graph expansion and will have very little contribution to the runtime outside of measuring very deep recursion. By using an std::vector-like allocation strategy for both rows & columns it is possible to make reallocation amortized $O(1)$.

Memory overhead of profiling is mostly defined by the aforementioned call graph matrix. For example, on thread that runs into 20 profiling macros and creates 100 nodes, memory overhead is going to be 8 kB. A thread that runs into 100 profiling macros and creates 500 call graph nodes, memory overhead will be 0.2 MB.

It is possible to further reduce memory overhead (down to 4 kB and 0.1 MB) by defining a UTL_PROFILER_USE_SMALL_IDS macro before the include:

#define UTL_PROFILER_USE_SMALL_IDS
#include "UTL/profiler.hpp"

This switches implementation to 16-bit IDs, which limits the max number of nodes to 65535. For most practical purposes this should be more than enough as most machines will reach stack overflow far before reaching such depth of the call graph.

Almost all profiling is lock-free, there are only 3 points at which implementation needs to lock a mutex:

  • When creating a new thread
  • When joining a thread
  • When manually calling profiler.upload_this_thread()

All public API is thread-safe.


Read the original article

Comments

  • By dustbunny 2025-04-1514:35

    As a gamedev, I almost never need the total time spent in a function, rather I need to visualize the total time spent in a function for that frame. And then I scan the output for long frames and examine those hotspots one frame at a time. Would be nice to be able to use that workflow in this, but visualizing it would be much different.

  • By gurkwart 2025-04-1511:512 reply

    Nice, I like the colored output tables. Started tinkering with a small profiling lib as well a while ago.

    https://github.com/gurki/glimmer

    It focuses on creating flamegraphs to view on e.g. https://www.speedscope.app/. I wanted to use std::stacktrace, but they are very costly to evaluate, even just lazily at exit. Eventually, I just tracked thread and call layer manually.

    If I understand correctly, you're tracking your call stack manually as well using some graph structure on linear ids? Mind elaborating a bit on its functionality and performance? Also proper platform-independent function names were a pita. Any comments on how you addressed that?

    • By dustbunny 2025-04-1514:19

      Speed scope is awesome.

      Ive been thinking about using speed scope as a reference to make a native viewer like that.

      Sampling profilers (like perf) are just so much easier to use than source markup ones. Just feel like the tooling around perf is bad and that speedscope is part of the solution.

    • By GeorgeHaldane 2025-04-1522:48

      General rundown of the logic can be found in this comment on reddit: https://www.reddit.com/r/cpp/comments/1jy6ver/comment/mmze20...

      About linear IDs: A call graph in general case is a tree of nodes, each node has a single parent and an arbitrary amount of children. Each node accumulates time spend in the "lower" branches. A neat property of the callgraph relative to a generic tree, is that every node can be associated with a callsite. For example, if a some function f() calls itself 3 recursively, there will be multiple nodes corresponding to it, but in terms of callsite there is still only one. So lets take some simple call graph as an example:

        Callgraph:         f() -> f() -> f() -> g()
                                             -> h()
        Node id:           0      1      2      3,4
      
      Let's say f() has callsite id '0', g() has callsite id '1', h() has callsite id '2'. The callgraph will then consist of N=5 nodes with M=3 different callsites:

        Node id:         { 0   1   2   3   4 }
        Callsite id:     { 0   0   0   1   2 }
      
      We can then encode all "prev."" nodes as a single N vector, and all "next" nodes as a MxN matrix, which has some kind of sentinel value (like -1) in places with no connection. For this example this will result in following:

        Node id:         { 0   1   2   3   4 }
        Prev. id:        { x   0   1   2   2 }
        Next id matrix:  [ 1   2   3   x   x ]
                         [ x   x   4   x   x ]
      
      Every thread has a thread-local callgraph object that keeps track of all this graph traversal, it holds 'current_node_id'. Traversing backwards on the graph is a single array lookup:

        current_node_id = prev_node_ids[current_node_id];
      
      Traversing forwards to an existing callgraph node is a lookup & branch:

        next_node_id = next_node_ids[callsite_id, current_node_id]
        if (next_node_id = x) create_node(next_node_id); // will be usually predicted
        else                  current_node_id = next_node_id;
      
      New nodes can be created pretty cheaply too, but too verbose for a comment. The key to tracking the callsites and assigning them IDs is thread_local local variables generated by the macro:

      https://github.com/DmitriBogdanov/UTL/blob/master/include/UT...

      When callsite marker initializes (which only happens once), it gets a new ID. Timer then gets this 'callsite_id' an passes it to the forwards-traversal. The way we get function names is by simply remembering __FILE__, __func__, __LINE__ pointers in another array of the call graph, they get saved during the callsite marker initialization too. As far as performance goes everything we do is cheap & simple operations, at this point the main overhead is just from taking the timestamps.

  • By bogwog 2025-04-1421:491 reply

    How does the compare to Microprofile?

    https://github.com/jonasmr/microprofile

    Btw, I recently worked with a library that had their own profiler which generated a Chrome trace file, so you could load it up in the Chrome dev tools to explore the call graph and timings in a fancy UI.

    It seems like such a good idea and I wish more profiling frameworks tried to do that instead of building their own UI.

    • By GeorgeHaldane 2025-04-150:12

      Haven't worked with it, but based on initial look it's a quite different thing that stands closer to a frame-based profiler like Tracy (https://github.com/wolfpld/tracy).

      As far as differences go:

      Microprofile:

        - frame-based
        - needs a build system
        - memory usage starts at 2 MB per thread
        - runs 2 threads of its own
        - provides system-specific info
        - good support for GPU workloads
        - provides live view
        - seems like a good choice for gamedev / rendering
      
      utl::profiler:

        - no specific pipeline
        - single include
        - memory usage starts at approx. nothing and would likely stay in kilobytes
        - doesn't run any additional threads
        - fully portable, nothing platform specific whatsoever, just standard C++
        - doesn't provide system-specific info, just pure timings
        - seems like a good choice for small projects or embedded (since the only thing it needs is a C++ compiler)

HackerNews