Best performance of a C++ singleton

2026-03-0613:404843andreasfertig.com

In my Januray post, I focused on implementing a singleton correctly. This time I want to add performance into the mix and show you the...

In my Januray post, I focused on implementing a singleton correctly. This time I want to add performance into the mix and show you the best way to implement your singleton... or give you guidance to pick your best way.

Setting the scene

I'm using a display manager as an example, like GDM, LightDM, or others in the Linux world. Here is the motivating implementation for today:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
A 
enum class Resolution
{
 r640x480,
 r800x600,
 // ...
}; B 
class DisplayManager {
 Resolution mResolution{};  C 
 DisplayManager(const DisplayManager&) = default;
 DisplayManager(DisplayManager&&) = default;  DisplayManager& operator=(const DisplayManager&) = default;
 DisplayManager& operator=(DisplayManager&&) = default;  DisplayManager() = default; D  public:
 static DisplayManager& Instance() noexcept
 {
 static DisplayManager dspm{}; E   return dspm;
 }  void SetResolution(Resolution newRes) { mResolution = newRes; }  Resolution GetResolution() const { return mResolution; }
};

Let me quickly go through the various parts. In A, you see the data type Resolution which illustrates two resolutions; you can imagine the rest. Next in B, you find the DisplayManger implementation. Diving into the implementation, you can see that I used my own advice from my last post and made the copy- and move-operations private in C. This is all just setup for today's focus.

To complete the picture, here is how I use the object:

Resolution Use()
{
 auto& s = DisplayManager::Instance();
 s.SetResolution(Resolution::r640x480);  return s.GetResolution();
}

Let's talk performance

Going back to the DisplayManager implementation, the interesting part starts with D, the default constructor, which of course must be private in a singleton. More on that in a moment. As a last item, you see E, where I use a block local static for the variable dspm.

Let's talk performance. With C and D we have two places where we can use different implementations that influence performance for DisplayManager objects, or better access. But you might not always have the full freedom to pick all the options.

In my DisplayManager implementation I present you with a simple case. The default constructor can be defaulted since DisplayManager only holds an object of type Resolution, a class enum which boils down to an integer type. I don't need any code inside the constructors body. There are cases when this doesn't apply and you need to write code for the constructor body. By that, we can distinguish two cases here:

  • defaultable default constructor (user-declared constructor)
  • a constructor with implementation (user-defined constructor)

If you look at the generated assembly for DisplayManager with a user declared constructor, you'll see this:

Use():
 mov DWORD PTR DisplayManager::Instance()::dspm[rip], 0
 xor eax, eax
 ret
main:
 xor eax, eax
 ret
DisplayManager::Instance()::dspm:
 .zero 4

For now, let's say that's good.

Once you look at the generated code for an implementation with a user-defined constructor you'll get this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Use():
 movzx eax, BYTE PTR guard variable for DisplayManager::Instance()::dspm[rip]
 test al, al
 je .L13
 mov DWORD PTR DisplayManager::Instance()::dspm[rip], 0
 xor eax, eax
 ret
.L13:
 sub rsp, 8
 mov edi, OFFSET FLAT:guard variable for DisplayManager::Instance()::dspm
 call __cxa_guard_acquire
 test eax, eax
 jne .L14
.L3:
 mov DWORD PTR DisplayManager::Instance()::dspm[rip], 0
 xor eax, eax
 add rsp, 8
 ret
.L14:
 mov DWORD PTR DisplayManager::Instance()::dspm[rip], 0
 mov edi, OFFSET FLAT:guard variable for DisplayManager::Instance()::dspm
 call __cxa_guard_release
 jmp .L3
main:
 xor eax, eax
 ret
guard variable for DisplayManager::Instance()::dspm:
 .zero 8
DisplayManager::Instance()::dspm:
 .zero 4

Now you can see why I called the user-defined version good. Once the compiler is required to have a default constructor, it must insert a guard variable and check the state each time you access Instance which adds up to a good amount of code. Please notice that at this point you're looking at code generated with GCC 15 at -O3 and I did not even call SetResolution or GetResolution.

Another thing to consider is that __cxa_guard_acquire and __cxa_guard_release introduce slight delays to your program.

Here is a Compiler Explorer link that shows the two options.

All right, what else can we do? Right, you can use a different approach in E.

I'm available for in-house C++ training classes worldwide, on-site or remote. Here is a sample list of my classes:
  • From C to C++
  • Programming with C++11 to C++17
  • Programming with C++20
All classes can be customized to your team's needs. Training services

Using a static data memeber

Instead of implementing the singleton pattern using a block local static variable, you can go for a private static data member. Time to see how this implementation behaves. Here is my implementation where I kept the labels stable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
B 
class DisplayManager {
 static DisplayManager mDspm; E1   Resolution mResolution{};  C 
 DisplayManager(const DisplayManager&) = default;
 DisplayManager(DisplayManager&&) = default;  DisplayManager& operator=(const DisplayManager&) = default;
 DisplayManager& operator=(DisplayManager&&) = default;  DisplayManager() = default; D  public:
 static DisplayManager& Instance() noexcept
 {
 E2 
 return mDspm;
 }  void SetResolution(Resolution newRes) { mResolution = newRes; }  Resolution GetResolution() const { return mResolution; }
}; // Imaging this code is in an implementation file.
DisplayManager DisplayManager::mDspm{}; E3 

You can see that the changes are only in E1, E2, and E3. The latter one is required just for completeness. The interesting change is in E2 where I no longer use a block local static but the static data member from E1. You still have the two options: user-declared and user-defined constructor.

For a user-declared constructor, my code results in:

Use():
 mov DWORD PTR DisplayManager::mDspm[rip], 0
 xor eax, eax
 ret
main:
 xor eax, eax
 ret
DisplayManager::mDspm:
 .zero 4

Which is exactly the same code as for the previous implementation. Things get interesting when you start looking at the user-declared constructor case:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Use():
 mov DWORD PTR DisplayManager::mDspm[rip], 0
 xor eax, eax
 ret
main:
 xor eax, eax
 ret
_GLOBAL__sub_I_DisplayManager::mDspm:
 mov DWORD PTR DisplayManager::mDspm[rip], 0
 ret
DisplayManager::mDspm:
 .zero 4

That code looks much better than the one before. No locks are required this time, which not only leads to less assembly code but also faster code at the same time.

You'll find the two versions here on Compiler Explorer.

Summary

If you want to have good performance for your singleton implementation and you need to provide a constructor, you should go for the static data member implementation. In case you can default the default constructor, the two implementation strategies are equivalent performancewise. I would suggest using the block local approach as it saves having to define and initialize the singleton object in an implementation file.

Andreas


Read the original article

Comments

  • By halayli 2026-03-085:475 reply

    The performance observation is real but the two approaches are not equivalent, and the article doesn't mention what you're actually trading away, which is the part that matters.

    The C++11 threadsafety guarantee on static initialization is explicitly scoped to block local statics. That's not an implementation detail, that's the guarantee.

    The __cxa_guard_acquire/release machinery in the assembly is the standard fulfilling that contract. Move to a private static data member and you're outside that guarantee entirely. You've quietly handed that responsibility back to yourself.

    Then there's the static initialization order fiasco, which is the whole reason the meyers singleton with a local static became canonical. Block local static initializes on first use, lazily, deterministically, thread safely. A static data member initializes at startup in an order that is undefined across translation units. If anything touches Instance() during its own static initialization from a different TU, you're in UB territory. The article doesn't mention this.

    Real world singleton designs also need: deferred/configuration-driven initialization, optional instantiation, state recycling, controlled teardown. A block local static keeps those doors open. A static data member initializes unconditionally at startup, you've lost lazy-init, you've lost the option to not initialize it, and configuration based instantiation becomes awkward by design.

    Honestly, if you're bottlenecking on singleton access, that's design smell worth addressing, not the guard variable.

    • By menaerus 2026-03-087:304 reply

      > Honestly, if you're bottlenecking on singleton access, that's design smell worth addressing, not the guard variable.

      There's a large group of engineers who are totally unaware of Amdahl's law and they are consequently obsessed with the performance implications of what are usually most non-important parts of the codebase.

      I learned that being in the opposite group of people became (or maybe has been always) somewhat unpopular because it breaks many of the myths that we have been taught for years, and on top of which many people have built their careers. This article may or may not be an example of that. I am not reading too much into it but profiling and identifying the actual bottlenecks seems like a scarce skill nowadays.

      • By PacificSpecific 2026-03-089:221 reply

        You leveled up past a point a surprising number of people get stuck on essentially.

        I feel likethe mindset you are describing is kind of this intermediate senior level. Sadly a lot of programmers can get stuck there for their whole career. Even worse when they get promoted to staff/principal level and start spreading dogma.

        I 100 percent agree. If you can't show me a real world performance difference you are just spinning your wheels and wasting time.

        • By menaerus 2026-03-0910:02

          Yes, I agree, and my experience is the same - there's just too many folks getting stuck in that mindset and never leaving it. Looking into the history I think software engineering domain has a lot of cargo-cult, which is somewhat surprising given that people who are naturally attracted to this domain are supposed to be critical thinkers. It turns out that this may not be true for most of the time. I know that I was also afoul of that but I learned my lesson.

      • By amluto 2026-03-0815:12

        On the flip side, it’s easy to get a bit stuck down the road by the mere fact that you have a singleton. Maybe you have amazing performance and very carefully managed safety, but you still have a single object that is inherently shared by all users in the same process, and it’s very very easy to end up regretting the semantic results. Been there, done that.

      • By pjmlp 2026-03-0811:02

        Worse, while shipping Electron crap is the other extreme, not everything needs to be written to fit into 64 KB or 16ms rendering frame.

        Many times taking a few extra ms, or God forbid 1s, is more than acceptable when there are humans in the loop.

      • By halayli 2026-03-0812:021 reply

        agreed. Strong emphasis on "profiling and identifying the actual bottleneck". Every benchmark will show a nested stack of performance offenders, but a solid interpretation requires a much deeper understanding of systems in general. My biggest aha moment yrs ago was when I realized that removing the function I was trying to optimize will still result in a benchmark output that shows top offenders and without going into too many details that minor perspective shift ended up paying dividends as it helped me rebuild my perspective on what benchmarks tell us.

        • By menaerus 2026-03-099:521 reply

          Yeah ... and so it happens that this particular function in the profile is just a symptom, merely being an observation (single) data point of system behavior under given workload, and not the root cause for, let's say, load instruction burning 90% of the CPU cycles by waiting on some data from the memory, and consequently giving you a wrong clue about the actual code creating that memory bus contention.

          I have to say that up until I grasped a pretty good understanding of CPU internals, memory subsystem, kernel, and generally the hardware, reading into the perf profiles was just a fun exercise giving me almost no meaningful results.

          • By halayli 2026-03-104:52

            Totally. I always found joy solving critical performance problems because it naturally pave a path forward to peel the layers and untangle the system interactions which feeds my curiosity, and is highly rewarding.

    • By cv5005 2026-03-0813:21

      >Then there's the static initialization order fiasco

      One of the reasons I hate constructors and destructors.

      Explicit init()/deinit() functions are much better.

    • By Rexxar 2026-03-0812:40

      The fact that he calls the generated code good/bad without discussing the semantic differences tells that the original author doesn't really know what he is talking about. That seems problematic to me as he is selling c++ online course.

    • By alex_dev42 2026-03-086:121 reply

      [dead]

      • By halayli 2026-03-086:38

        Yes definitely not dismissing the lock overhead, but I wanted to bring attention to the implicit false equivalence made in the post. That said, I am surprised the lock check was showing up and not the logging/formatting functions.

    • By csegaults 2026-03-086:093 reply

      [flagged]

      • By halayli 2026-03-086:281 reply

        a real human. threads can exist before main() starts. for example, you can include another tu which happens to launch a thread and call instance(). Singletons used to be a headache before C++11 and it was common(maybe still is) to see macros in projects that expand to a singleton class definition to avoid common pitfalls.

      • By platinumrad 2026-03-086:14

        It's a bit contrived, but a global with a nontrivial constructor can spawn a thread that uses another global, and without synchronization the thread can see an uninitialized or partially initialized value.

      • By jibal 2026-03-087:47

        [flagged]

  • By platinumrad 2026-03-085:441 reply

    I haven't written C++ in a long time, but isn't the issue here that the initialization order of globals in different translation units is unspecified? Lazy initialization avoids that problem at very modest cost.

    • By pjmlp 2026-03-0811:08

      Yeah, that is part of it.

  • By m-schuetz 2026-03-084:381 reply

    I liked using singletons back in the day, but now I simply make a struct with static members which serves the same purpose with less verbose code. Initialization order doesn't matter if you add one explicit (and also static) init function, or a lazy initialization check.

    • By procaryote 2026-03-088:161 reply

      Yeah, I feel singletons are mostly a result of people learning globals are bad and wanting to pretend their global isn't a global.

      A bit like how java people insisted on making naive getFoo() and setFoo() to pretend that was different from making foo public

      • By jonathanlydall 2026-03-0811:161 reply

        > A bit like how java people insisted on making naive getFoo() and setFoo() to pretend that was different from making foo public

        But it's absolutely different and sometimes it really matters.

        I primarily work with C# which has the "property" member type which is essentially a first-class language feature for having a get and set method for a field on a type. What's nice about C# properties is that you don't have to manually create the backing field and implement the logic to get/set it, but you still have the option to do it at a later time if you want.

        When you compile C# code (I expect Java is the essentially same) which accesses the member of another class, the generated IL/Bytecode is different depending on whether you're accessing a field, property or method.

        This means that if you later find it would be useful to intercept gets or updates to a field and add some additional logic for some reason (e.g. you want to now do lazy initialization), if you naively change the field to a method/property (even with the same name), existing code compiled against your original class will now fail at runtime with something like a "member not found" exception. Consumers of your library will be forced to recompile their code against your latest version for things to work again.

        By having getters and setters, you have the option of changing things without breaking existing consumers of your code. For certain libraries or platforms, this is the practical difference between being stuck with certain (now undesirable) behaviour forever or trivially being able to change it.

        • By procaryote 2026-03-0820:341 reply

          Adding lots of code for the common case to support consumers of the code not recompiling for some uncommon potential future corner-cases seems like a bad deal.

          Recompiling isn't that hard usually.

          • By jonathanlydall 2026-03-0911:19

            In a product world where customers are building on your platform, requiring that they schedule time with their own developers to recompile everything in order to move to the latest version of your product is an opportunity to lose one or more of those paying customers.

            These customers would also be quite rightfully annoyed when their devs report back to them that the extra work could have been entirely avoided if your own devs had done the industry norm of using setters/getters.

            Maybe you're not a product but there are various other teams at your organization which use your library, now in order to go live you need to coordinate with various different teams that they also update their code so that things don't break. These teams will report to their PMs how this could have all been avoided if only you had used getters and setters, like the entire industry recommends.

            Unless you're in a company with a single development team building a small system whose code would never be touched by anyone else, it's a good idea to do the setters/getters. And even then, what's true today might not be true years from now.

            It's generally good practice for a reason.

HackerNews