About memory pressure, lock contention, and Data-oriented Design

2026-03-0811:596821mnt.io

I'm here to narrate you a story about performance. Recently, I was in the same room as some Memory Pressure and some Lock Contention. It took me a while to recognize them. Legend says it only happens…

I'm here to narrate you a story about performance. Recently, I was in the same room as some Memory Pressure and some Lock Contention. It took me a while to recognize them. Legend says it only happens in obscure, low-level systems, but I'm here to refute the legend. While exploring, I had the pleasure of fixing a funny bug in a higher-order stream: lucky us, to top it all off, we even have a sweet treat! This story is also a pretext to introduce you to Data-oriented Design, and to show how it improved execution time by 98.7% and throughput by 7718.5%. I believe we have all the ingredients for a juicy story. Let's cook, and bon appétit !

On a Beautiful Morning…#

While powering on my Dygma Defy, unlocking my computer, and checking messages from my colleagues, I suddenly come across this one:

Does anyone also experience a frozen room list?

Ah yeah, for some years now, I've been employed by Element to work on the Matrix Rust SDK. If one needs to write a complete, modern, cross-platform, fast Matrix client or bot, this SDK is an excellent choice. The SDK is composed of many crates. Some are very low in the stack and are not aimed at being used directly by developers, like matrix_sdk_crypto. Some others are higher in the stack — the highest is for User Interfaces (UI) with matrix_sdk_ui. While it is a bit opinionated, it is designed to provide the high-quality features everybody expects in a modern Matrix client.

One of these features is the Room List. The Room List is a place where users spend a lot of their time in a messaging application (along with the Timeline, i.e. the room's messages). Some expectations for this component:

  • Be superfast,
  • List all the rooms,
  • Interact with rooms (open them, mark them as unread etc.),
  • Filter the rooms,
  • Sort the rooms.

Let's focus on the part that interests us today: Sort the rooms. The Room List holds… no rooms. It actually provides a stream of updates about rooms; more precisely a Stream<Item = Vec<VectorDiff<Room>>>. What does this mean? The stream yields a vector of “diffs” of rooms. I'm writing a series about reactive programming — you might be interested to read more about it. Otherwise, here is what you need to know.

The VectorDiff type comes from the eyeball-im crate, initially created for the Matrix Rust SDK as a solid foundation for reactive programming. It looks like this:

pub enum VectorDiff<T> {

Append {

values: Vector<T>,

},

Clear,

PushFront {

value: T,

},

PushBack {

value: T,

},

PopFront,

PopBack,

Insert {

index: usize,

value: T,

},

Set {

index: usize,

value: T,

},

Remove {

index: usize,

},

Truncate {

length: usize,

},

Reset {

values: Vector<T>,

},

}

It represents a change in an ObservableVector. This is like a Vec, but one can subscribe to the changes, and will receive… well… VectorDiffs!

The Room List type merges several streams into a single stream representing the list of rooms. For example, let's imagine the room at index 3 receives a new message. Its “preview” (the latest event displayed beneath the room's name, e.g. Alice: Hello!) changes. Also, the Room List sorts rooms by their “recency” (the time something happened in the room). And since the “preview” has changed, its “recency” changes too, which means the room is sorted and re-positioned. Then, we expect the Room List's stream to yield:

  1. VectorDiff::Set { index: 3, value: new_room } because of the new “preview”,
  2. VectorDiff::Remove { index: 3 } to remove the room… immediately followed by
  3. VectorDiff::PushFront { value: new_room } to insert the room at the top of the Room List.

This reactive programming mechanism has proven to be extremely efficient.

Absolutely! These are two popular UI components where a VectorDiff maps straightforwardly to their List component update operations. They are actually (remarkably) pretty similar to each other.

You're always a good digression companion, thank you. Let's go back on our problem:

What does "frozen" mean for the Room List?

It means that the Room List is simply… blank, empty, vide, vacía, vuoto, خلو… well, you get the idea.

What could freeze the Room List?

What are our options?

Indeed, we have changed one sorter recently. Let's take a look at how this Room List stream is computed, shall we?

let stream = stream! {

loop {

let filter = filter_cell.take().await;

let (initial_values, stream) = self.entries();

let stream = merge_streams(initial_values.clone(), stream, other_updates);

let (initial_values, stream) = (initial_values, stream)

.filter(filter)

.sort_by(new_sorter_lexicographic(vec![

Box::new(new_sorter_latest_event()),

Box::new(new_sorter_recency()),

Box::new(new_sorter_name()),

]))

.dynamic_head_with_initial_value(page_size, limit_stream);

yield once(ready(vec![VectorDiff::Reset { values: initial_values }]))

.chain(stream);

}

}

.switch();

There is a lot going on here. Sadly, we are not going to explain everything in this beautiful piece of art.

The .filter(), .sort_by() and .dynamic_head_with_initial_value() methods are part of the eyeball-im-util crate. They are used to filter, sort etc. a stream: They are essentially mapping a Stream<Item = Vec<VectorDiff<T>>> to another Stream<Item = Vec<VectorDiff<T>>>. In other terms, they “change” the VectorDiffs on-the-fly to simulate filtering, sorting, or something else. Let's see a very concrete example with the Sort higher-order stream (the following example is mostly a copy of the documentation of Sort, but since I wrote this algorithm, I guess you, dear reader, will find it acceptable).

Let's imagine we have a vector of char. We want a Stream of changes about this vector (the famous VectorDiff). We also want to simulate a sorted vector, by only modifying the changes. The solution looks like this:

use std::cmp::Ordering;

use eyeball_im::{ObservableVector, VectorDiff};

use eyeball_im_util::vector::VectorObserverExt;

use stream_assert::{assert_next_eq, assert_pending};

fn cmp<T>(left: &T, right: &T) -> Ordering

where

T: Ord,

{

left.cmp(right)

}

let mut vector = ObservableVector::<char>::new();

let (initial_values, mut stream) = vector.subscribe().sort_by(cmp);

assert!(initial_values.is_empty());

assert_pending!(stream);

Alrighty. That's a good start. vector is empty, so the initial values from the subscribe are empty, and the stream is also pending. I think it's time to play with this new toy, isn't it?


vector.append(vector!['d', 'b', 'e']);

assert_next_eq!(

stream,

VectorDiff::Append { values: vector!['b', 'd', 'e'] }

);

assert_pending!(stream);

So far, so good. It looks naive and simple: one operation in, one operation out. It's funnier when things get more complicated though:


vector.append(vector!['f', 'g', 'a', 'c']);

assert_next_eq!(

stream,

VectorDiff::PushFront { value: 'a' }

);

assert_next_eq!(

stream,

VectorDiff::Insert { index: 2, value: 'c' }

);

assert_next_eq!(

stream,

VectorDiff::Append { values: vector!['f', 'g'] }

);

assert_pending!(stream);

Notice how vector is never sorted. That's the power of these higher-order streams of VectorDiffs: light and —more importantly— combinable! I repeat myself: we are always mapping a Stream<Item = Vec<VectorDiff<T>>> to another Stream<Item = Vec<VectorDiff<T>>>. That's the same type! The whole collection is never computed entirely, except for the initial values: only the changes are handled and trigger a computation. Knowing that, in the manner of Future, Stream is lazy —i.e. it does something only when polled—, it makes things pretty efficient. And…

Which topic? Ah! The frozen Room List! Sorters are not the culprit. There. Happy? Short enough?

These details were important. Kind of. I hope you've learned something along the way. Next, let's see how a sorter works, and how it could be responsible for our memory pressure and lock contention.

Randomness#

Taking a step back, I was asking myself: Is it really frozen?. The cherry on the cake: I was unable to reproduce the problem! Even the reporters of the problem were unable to reproduce it consistently. Hmm, a random problem? Fortunately, two of the reporters are obstinate. Ultimately, we got analysis.

Memory analysis of Element X in Android Studio (Element X is based on the Matrix Rust SDK). It presents a callback tree, with the number of allocations and deallocations for each node in this tree. Thanks Jorge!

And, holy cow, we see a lot of memory allocations, exactly 322'042 to be precise, counting for 743Mib, for the eyeball_im_util::vector::sort::SortBy type! I don't remember exactly how many rooms are part of the Room List, but it's probably around 500-600.

Download fullsize image as: AVIF, WebP, PNG.

The Room List wasn't frozen. It was taking a lot of time to yield values. Sometimes, up to 5 minutes on a phone. Alright, we have two problems to solve here:

  1. Why is it random?
  2. Why so many memory allocations and deallocations?

The second problem will be discussed in the next section. Let's start with the first problem in this section, shall we?

Let's start at the beginning. eyeball_im_util::vector::sort::SortBy is used like so:

stream

.sort_by(new_sorter_lexicographic(vec![

Box::new(new_sorter_latest_event()),

Box::new(new_sorter_recency()),

Box::new(new_sorter_name()),

]))

sort_by receives a sorter: new_sorter_lexicographic. It's from matrix_sdk_ui::room_list::sorters, and it's a constructor for a… lexicographic sorter. All sorters must implement the Sorter trait. Once again, it's a trait from matrix_sdk_ui, nothing fancy, it's simply this:


pub trait Sorter: Fn(&Room, &Room) -> Ordering {}

impl<F> Sorter for F

where

F: Fn(&Room, &Room) -> Ordering {}

Put it differently, all functions with two parameters of type &Room, and with a return type Ordering is considered a sorter. There. It's crystal clear now, except… what's a lexicographic sorter?

In short, we are executing 3 sorters: by latest event, by recency and by name.

None of these sorters are using any form of randomness. It's a cul-de-sac. Let's take a step back by looking at SortBy in eyeball_im_util itself maybe? Scroll the documentation, not here, read the initial patch, hmm, I see a mention of a binary search, jump into the code, ah, here, look at the comment:

When looking for the position of a value (e.g. where to insert a new value?), Vector::binary_search_by is used — it is possible because the Vector is sorted. When looking for the unsorted index of a value, Iterator::position is used.

Vector::binary_search_by doesn't mention any form of randomness in its documentation. Another cul-de-sac.

Ah! Brilliant. That's correct! Looking at the constructor of SortBy (or its implementation), we notice it's using Vector::sort_by. And guess what? It's relying on… drum rollquicksort! Following the path, we see it actually creates a pseudo random number generator (PRNG) to do the quicksort.

Phew. Finally. Time for a cup of tea and a biscuit.

My guess here is the following. Depending on the (pseudo randomly) generated pivot index, the number of comparisons may vary each time this runs. We can enter a pathological case where more comparisons means more memory pressure, which means slower sorting, which means… A Frozen Room ListTM, play horror movie music!

Memory Pressure#

A memory allocator is responsible for… well… allocating the memory. If you believe this is a simple problem, please retract this offensive thought quickly: what an oaf! Memory is managed based on the strategy or strategies used by the memory allocator: there is not a unique solution. Each memory allocator comes with trade-offs: do you allocate and replace multiple similar small objects several times in a row, do you need fixed-size blocks of memory, dynamic blocks etc.

Allocating memory is not free. The memory allocator has a cost in itself —which could be mitigated by implementing a custom memory allocator maybe—, but there is also a hardware cost, and it's comparatively more difficult to mitigate. Memory is allocated on the heap, i.e. the RAM, also called the main memory (not be confused with CPU caches: L1, L2…). The RAM is nice and all, but it lives far from the CPU. It takes time to allocate something on the heap and…

Hmm, refrain from opening the Pandora's box, let's try to stay high-level here, shall we? Be careful: the numbers I am going to present can vary depending on your hardware, but the important part is the scale: keep that in mind.

OperationTime“Human scale”
Fetch from L1 cache1ns1mn
Branch misprediction3ns3mn
Fetch from L2 cache4ns4mn
Mutex lock/unlock17ns17mn
Fetch from the main memory100ns1h40mn
SSD random read16'000ns11.11 days

Latency numbers for the year 2020 for various operations (source: Latency Numbers Every Programmer Shoud Know from Colin Scott (UC Berkeley)).

The time in the second column is given in nanoseconds, i.e. 11'000'000'000 second. The time in the third column is “humanized” to give us a better sense of the scale here: we imagine 1ns maps to 1min.

Do you see the difference between the L1/L2 caches and the main memory? 1ns to 100ns is the same difference as 1mn to 1h40. So, yes, it takes time to read from memory. That's why we try to avoid allocations as much as possible.

CPU CPU CPU L1 L2 RAM

Not comfortable with numbers? Let's try to visualise it with 1ns = 1s! On the left: the CPU. On the right, the L1 cache, the L2 cache, and the RAM. The “balls” represent the time it takes to move information between the CPU and the L1/L2 caches or the RAM.

Sadly, in our case, it appears we are allocating 322'042 times to sort the initial rooms of the Room List, for a total of 743'151'616 bits allocated, with 287 bytes per allocation. Of course, if we are doing quick napkin maths, it should take around 200ms. We are far from The Frozen Room ListTM, but there is more going on.

Do you remember the memory allocator? Its role is to also avoid fragmentation as much as possible. The number of memory “blocks” isn't infinite: when memory blocks are freed, and new ones are allocated later, maybe the previous blocks are no longer available and cannot be reused. The allocator has to find a good place, while keeping fragmentation under control. Maybe the blocks must be moved to create enough space to insert the new blocks (it's often preferable to have contiguous blocks).

That's what I call memory pressure. We are asking too much, too fast, and the memory allocator we use in the Matrix Rust SDK is not designed to handle this use case.

What are our solutions then?

Excellent ideas. Let's track which sorter creates the problem. We start with the sorter that was recently modified: latest_event. In short, this sorter compares the LatestEventValue of two rooms: the idea is that rooms with a LatestEventValue representing a local event, i.e. an event that is not sent yet, or is sending, must be at the top of the Room List. Alright, let's look at its core part:

pub fn new_sorter() -> impl Sorter {

let latest_events =

|left: &Room, right: &Room| (left.latest_event(), right.latest_event());

move |left, right| -> Ordering { cmp(latest_events, left, right) }

}

Alright. For each sorting iteration, the Room::latest_event method is called twice. This method is as follows:

pub fn latest_event(&self) -> LatestEventValue {

self.info.read().latest_event.clone()

}

Oh, there it is. We are acquiring a read lock over the info value, then we are reading the latest_event field, and we are cloning the value. Cloning is important here as we don't want to hold the read lock for too long. This is our culprit. The size of the LatestEventValue type is 144 bytes (it doesn't count the size of the event itself, because this size is dynamic).

Before going further, let's check whether another sorter has a similar problem, shall we? Look at the other sorters, oh!, turns out the recency sorter also uses the latest_event method! Damn, this is becoming really annoying.

Question: do we need the entire LatestEventValue? Probably not!

  • For the latest_event sorter, we actually only need to know when this LatestEventValue is local, that's it.
  • For the recency sorter, we only need to know the timestamp of the LatestEventValue.

So instead of copying the whole value in memory twice per sorter iteration, for two sorters, let's try to write more specific methods:

pub fn latest_event_is_local(&self) -> bool {

self.info.read().latest_event.is_local()

}

pub fn latest_event_timestamp(&self) -> Option<MilliSecondsSinceUnixEpoch> {

self.info.read().latest_event.timestamp()

}

Just like that, the throughput has been improved by 18% according to the room_list benchmark. You can see the patch in “action”. Can we declare victory over memory pressure?

Definitely yes! Everything above 200ms (from our napkin maths) is unacceptable here. Memory pressure was an important problem, and it's now solved, but it wasn't the only problem.

Lock Contention#

The assiduous reader may have noticed that we are still dealing with a lock here.

self.info.read().latest_event.

Do you remember we had 322'042 allocations? It represents the number of times the latest_event method was called basically, which means…

… yes… and please, stop interrupting me, I was trying to build up the suspense for a climax.

Anyway. Avoiding a lock isn't an easy task. However, this lock around info is particularly annoying because it's called by almost all sorters! They need information about a Room; all the information is in this info field, which is a read-write lock. Hmmm.

Let's change our strategy. We need to take a step back:

  1. The sorters need this data.
  2. Running the sorters won't change this data.
  3. When the data does change the sorters will be re-run.

Maybe we could fetch, ahead of time, all the necessary data for all sorters in a single type: it will be refreshed when the data changes, which is right before the sorters run again.

That's correct. If the type is small enough, it can fit more easily in the CPU caches, like L1 or L2. Do you remember how fast they are? 1ns and 4ns, much faster than the 100ns for the main memory. Moreover, it removes the lock contention and the memory pressure entirely!

I highly recommend watching the following talks if you want to learn more about Data-oriented Design (DoD)

Video: Data-Oriented Design and C++, by Mike Acton, at the CppCon 2014

The transformation of data is the only purpose of any program. Common approaches in C++ which are antithetical to this goal will be presented in the context of a performance-critical domain (console game development). Additionally, limitations inherent in any C++ compiler and how that affects the practical use of the language when transforming that data will be demonstrated. View the slides.

Video: Cpu Caches and Why You Care, by Scott Meyers, at the code::dive conference 2014

This talk explores CPU caches and their impact on program performance.

So. Let's be serious: I suggest trying to do some Data-oriented Design here. We start by putting all our data in a single type:

pub struct RoomListItem {

cached_latest_event_timestamp: Option<MilliSecondsSinceUnixEpoch>,

cached_latest_event_is_local: bool,

cached_recency_stamp: Option<RoomRecencyStamp>,

cached_display_name: Option<String>,

cached_is_space: bool,

cached_state: RoomState,

}

impl RoomListItem {

fn refresh_cached_data(&mut self, room: &Room) {

self.cached_latest_event_timestamp = room.new_latest_event_timestamp();

}

}

At this point, the size of RoomListItem is 64 bytes, acceptably small!

A bit of plumbing later, this new RoomListItem type is used everywhere by the Room List, by all its filters and all its sorters. For example, the latest_event sorter now looks like:

pub fn new_sorter() -> impl Sorter {

let latest_events = |left: &RoomListItem, right: &RoomListItem| {

(left.cached_latest_event_is_local, right.cached_latest_event_is_local)

};

move |left, right| -> Ordering { cmp(latest_events, left, right) }

}

The lock acquisitions happen only in refresh_cached_data, when a new update happens, not during the filtering or sorting anymore. Let's see what the benchmark has to say now.

Before:

$ cargo bench --bench room_list

RoomList/Create/1000 rooms × 1000 events

time: [53.027 ms 53.149 ms 53.273 ms]

thrpt: [18.771 Kelem/s 18.815 Kelem/s 18.858 Kelem/s]

After:

$ cargo bench --bench room_list

RoomList/Create/1000 rooms × 1000 events

time: [676.29 µs 676.84 µs 677.50 µs]

thrpt: [1.4760 Melem/s 1.4775 Melem/s 1.4787 Melem/s]

change:

time: [-98.725% -98.721% -98.716%] (p = 0.00 < 0.05)

thrpt: [+7686.9% +7718.5% +7745.6%]

Performance has improved.

Boom!

We don't see the 5 minutes lag mentioned by the reporters, but remember it's random. Nonetheless, the performance impact is huge:

  • From 18.8Kelem/s to 1.4Melem/s,
  • From 53ms to 676µs, or —to compare with the same unit— 0.676ms, so 78× faster!
  • The throughput has improved by 7718.5%, and the time by 98.7%.

Can we claim victory now?

The benchmark Iteration Times and Regression graphs are interesting to look at.

Iteration times

The initial Iteration Times, before our patches. Notice how the points do not follow any “trend”. It's a clear sign the program is acting erratically.

Iteration times

The final Iteration Times/Regression, after our patches. Notice how the points are linear.

The second graph is the kind of graph I like. Predictable.

Of course you're valuable! Now, the surprise.

The Dessert#

Of course, let's not forget about our dessert! I won't dig too much: the patch contains all the necessary gory details. In short, it's about how VectorDiff::Set can create a nasty bug in SortBy. Basically, when a value in the vector is updated, a VectorDiff::Set is emitted. SortBy is then responsible for computing a new VectorDiff:

  • it was calculating the old position of the value,
  • it was calculating the new position,
  • depending on that, it was emitting the appropriate VectorDiffs.

However, the old “value” wasn't removed from the buffer immediately and not every time. In theory, it should not cause any problem —it was an optimisation after all— except if… the items manipulated by the stream are “shallow clones”. Shallow cloning a value won't copy the value entirely: we get a new value, but its state is synced with the original value. This happens with types such as:

#[derive(Clone)]

struct S {

inner: Arc<T>

}

Here, cloning a value of type S and changing its inner field will also update the original value.

Just like that, it was possible to systematically create… an infinite loop. Funky isn't it?

You can view the patch Fix an infinite loop when SortBy<Stream<Item = T>> handles a VectorDiff::Set where T is a shallow clone type to learn more.

I think this is a concrete example of when jumping on an optimisation can lead to a bug. I'm not saying we should not prematurely optimise our programs: I'm a partisan of the “we should” camp. I'm saying that bugs can be pretty subtle sometimes, and this bug would have been avoided if we hadn't taken a shortcut in this algorithm. It's important to be correct first, then measure, then improve.

I hope you've learned a couple of things, and you've enjoyed your reading.

I would like to thank Andy Balaam and Damir Jelić for the reviews and the feedback!


Read the original article

Comments

  • By hinkley 2026-03-1211:381 reply

    > Legend says it only happens in obscure, low-level systems, but I'm here to refute the legend.

    In a somewhat self-deprecating, but also an “anyone can do this if they have the will” fashion, I tend to say my real skill in performance tuning is stubbornness. Orphaned gains can be easily responsible for 1/2 of the runtime of your average operation, sometimes as high as 2/3. The first 20% is fairly straightforward, the rest is just hard work. There’s always four smartasses willing to misquote a Knuth aphorism they never understood, but the fact is if your customers hate how slow the product is, and more to the point, if your marketing people are talking about it, it’s time to get off your asses and do the hard work, because your shit is broken.

    Yes there’s specialized knowledge but is it really any harder than understanding how databases work? Probably not. But if what is described I’m the quote above is a “legend”, then either things have got a lot worse while I wasn’t looking, or I’m underplaying the value of those skills in the overall mix, or both.

    The biggest memory pressure problem I ever fixed, I deduped running the same DB query twice and intersecting two filters of that data. I was expecting about a 2x improvement, but instead I got 10x. In retrospect I should have hoped for maybe as high as 4x due to the extra short circuiting on the data set, but not 10x. That was all better data locality.

    The irony of that situation is that I had been pulled in as a fixer and asked if I could figure out how to get a request taking thirty seconds down to three. This was supposed to be my opening salvo in a several week drill down and a series of improvements going from 15 to 10 to 8 to 7 seconds etc and prepping to apologize when I was only able to get it down to five, maybe six seconds. And instead I was done in like two days. But it did serve my larger thesis that we were using caching wrong. Data passed on the call stack doesn’t have to be looked up a second time. Or a third, or a fifth.

    • By mikrl 2026-03-1212:102 reply

      It feels like my era of education 2012-2020 (couple of degrees over that time) really deemphasized perf tuning, even heard it was practically useless in current day a few times.

      I had a computer organization course that came close but mostly just described microarchitecture and its historical development, not so much the practical ways to exploit it.

      Actually taking the time to sit down and poke around with techniques was mind blowing. I grew up during the golden age of CPU and OS advancements 90s-00s and the rush from seeing ‘instructions per cycle’ > 1 captured a bit of that magic that CRUD app dev and wrangling k8s just doesn’t have.

      • By tialaramex 2026-03-1212:332 reply

        It's not worth optimising if you don't have a problem. Focus your effort.

        If this code runs once per week at midnight, needs to finish by 5am, and currently it takes 18 minutes, the fact it could take 40 seconds isn't actually important and so spending meaningful engineering effort to go from 18 minutes to 40 seconds is a waste.

        On the other hand, if the code runs on every toaster when it's started and ideally would finish before the toast pops up, but currently takes 4 minutes, then even getting it down to 2.5 minutes will make more customers happy [also, why the fuck are we running software in the toaster? But that's beside the point] and might well be worth doing.

        The classic UX examples given are much closer to the latter category. When I type fast the symbols ought to appear immediately for example, if you can't do that then you have a performance problem and optimisation is appropriate. But so much of what software engineers do all day isn't in that space and doesn't need to prioritise performance so optimisation shouldn't be a priority.

        In particular Fast but Wrong is just Wrong. https://x.com/magdraws/status/1551612747569299458

        • By mikrl 2026-03-1213:46

          >so much of what software engineers do all day isn't in that space

          Seems to me that critical infra that supports a lot of modern computing is in that space though.

          If you want to develop that depth of knowledge you need to go into HPC/scientific, trading or accelerator hardware. I didn’t get into this sometimes crazy industry to NOT learn stuff and push the limits of my computer.

          I’m glad I know about those applications now, but I wonder how much of a disservice we did to the industry by just focusing on frameworks and abstraction especially now that you can just sling a lot of that out with a prompt…

        • By hinkley 2026-03-1219:40

          In the era of kubernetes and edge servers and everything running on battery power, that distinction between need and want becomes much fuzzier because of course we can bin pack the more efficient one better or preserve another five minutes of standby time even if the wall clock behavior is moot.

          And I’d also argue that if you wait to use a skill only until the need is dire then you will be both 1) shit at doing it and fail to achieve your goal well and 2) won’t have spent enough time on the cost/benefit analysis to know when things have changed over from want to need. Like the blind people I allude to in my top level.

      • By hinkley 2026-03-1220:16

        I went to a top ten school. I had one semester of circuit design, one one of EE, and a couple of computer architecture that went over the MIPs and writing assembly.

        I think there was some sort of transition of curriculum going on with the introductory classes though because the difficulty from one homework assignment to the next that first year of CS 1XX classes was pretty choppy. A friend and I made a game of one-upsmanship of adding our own constraints to the easier assignments to make them more interesting. Like taking larger inputs than the requirements and counting execution time.

        When I left school my first job the application was glacially slow, and I learned half of what I know about optimization in a short stint there through trial and error. It was a couple jobs in before I ever got pushback and had to learn the human factors element. But it (the optimization balanced against readability, robustness, extensibility) was a way I have always made pedestrian work more interesting. There are whole classes of code smells that also contain performance penalties, and at the peak of my restlessness I needed those to keep my sanity without irritating coworkers. I’m just cleaning up this messy code, nothing to see here.

        Reading release notes for other tools bragging on their improvements. Dev tools and frameworks are more forthcoming about how and what than consumer apps, but there are standouts from time to time. I read a ton of SIGPLAN proceedings during that era. Fortune favors the prepared mind and you look a lot smarter when you’re confronting a problem or opportunity with a primed pump rather than coming in cold (being friendly with other disciplines in your company also helps there).

  • By asQuirreL 2026-03-1210:181 reply

    Post can be summarised quite succinctly:

    Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.

    I'm looking forward to the next post about how cache consistency is tough.

    • By hinkley 2026-03-1211:462 reply

      By far my favorite feature of lodash is the sortby function, in which instead of providing a custom comparator as most std libraries’ sort() function offers, provides a way to substitute a simpler object for the comparator to chew on. If your comparator needs to go beyond a couple nested conditionals, to actual data transform or grabbing locks, then that nasty little logn term in the runtime can take your sort from using 20% of the time budget for an expensive operation to taking 120%. Especially when you consider the memory bandwidth sloshing around L3 or the main memory bus to do the full table scan logn times.

      I think the world would be better off if this wasn’t in a third part library in a single programming language. Iirc Ruby is the only language I know with it as a built-in.

      • By tialaramex 2026-03-1212:041 reply

        Rust's slice of T [T] provides [T]::sort_by_cached_key which is a stable IPN sort which lets you provide a key transform callable f, which it will call at most once for each item to be sorted, sorting by comparing the (cached) results from that callable.

        https://doc.rust-lang.org/std/primitive.slice.html#method.so...

        However ..._by_cached_key is not provided for Rust's unstable sort because the unstable sort doesn't require an allocator and necessarily a cache does need an allocator.

        • By hinkley 2026-03-1223:43

          Yeah this is a function called sort_by but I don’t think it’s doing the same thing.

              let (initial_values, stream) = (initial_values, stream)
                      .filter(filter)
                      .sort_by(new_sorter_lexicographic(vec![
                          // Sort by latest event's kind.
                          Box::new(new_sorter_latest_event()),
                          // Sort rooms by their recency.
                          Box::new(new_sorter_recency()),
                          // Finally, sort by name.
                          Box::new(new_sorter_name()),
                      ]))
                      .dynamic_head_with_initial_value(page_size, limit_stream);
          
          
          That’s an api that would make an OG Java developer get tingles.

          sortBy should be locking each object once and I’m reasonably sure this is happening at least three times. Author ends up approximating _.sortBy() at the bottom by introducing a struct.

      • By wbl 2026-03-1213:33

        This is well known in Perl as the Schwartz transform

  • By scottlamb 2026-03-1213:511 reply

    IMO, there are a lot of smells in this code not addressed in the article. I only skimmed, and still, here are a few:

    1. They represent a single room change with this sequence of three operations:

        VectorDiff::Set { index: 3, value: new_room } because of the new “preview”,
        VectorDiff::Remove { index: 3 } to remove the room… immediately followed by
        VectorDiff::PushFront { value: new_room } to insert the room at the top of the Room List.
    
    and I don't see any mention of atomic sequences. I think the room will momentarily disappear from view before being placed into the correct spot. That kind of thing would drive me nuts as a user. It suggests to me this is not the right abstraction.

    Also, if you are actually representing the result with a vector, it's O(n), so from a performance perspective, it's not great if the vector can be large: you're shifting everything from [3, n) one spot forward and then one spot back, unnecessarily. If there were a `VectorDiff::Move`, you'd only be shifting 3 elements (the distance moved). Could still be the full length of the list but probably usually not? Something like a `BTreeSet` would make it actually O(lg n).

    2. Taking a lock in a comparison function (they call it `Sorter`, but the name is wrong) is a smell for correctness as well as performance. Can the values change mid-sort? Then the result is non-deterministic. (In C++ it's actually undefined behavior to use a non-deterministic comparator. In Rust it's safe but still a bad idea.) You just can't sort values while they're changing, full stop, so inner mutability in a list you're sorting is suss. [edit: and for what? within a client, are you seriously doing heavy mutations on many rooms at once? or is a single lock on all the rooms sufficient?]

    3. The sorted adapter just degrades to insertion sort of changes right here: <https://docs.rs/eyeball-im-util/0.10.0/src/eyeball_im_util/v...> and decomposes what could have been an atomic operation (append) into several inserts. Even `Set` does a linear scan and then becomes a (non-atomic again) remove and an insert, because it can change the sort order.

    4. The `.sort_by(new_sorter_lexicographic(vec![Box(...), Box(...), Box(...)]))` means that it's doing up to three dynamic dispatches on each comparison. The `new_sorter_lexicographic` is trivial, so inline those instead. And definitely don't take a separate lock on each, yuck, although see above anyway about how you just shouldn't have locks within the vec you're sorting.

    I would never use these abstractions.

    • By scottlamb 2026-03-1217:54

      5. In their "dessert" section, they talk about a problem with sort when the items are shallow clones. It's an example of a broader problem: they put something into an `ObservableVector` but then semantically mutate it via inner mutability (defeating the "observable"). You just can't do that. The sort infinite loop is the tip of the iceberg. Everything relying on the observable aspect is then wrong. The lesson isn't just "jumping on an optimization can lead to a bug"; it's also that abstractions have contracts.

HackerNews