Recursion kills: The story behind CVE-2024-8176 in libexpat

2025-03-1322:05162163blog.hartwork.org

For readers new to Expat: libexpat is a fast streaming XML parser. Alongside libxml2, Expat is one of the most widely used software libre XML parsers written in C, specifically C99. It is cross…

For readers new to Expat:

libexpat is a fast streaming XML parser. Alongside libxml2, Expat is one of the most widely used software libre XML parsers written in C, specifically C99. It is cross-platform and licensed under the MIT license.

Expat 2.7.0 has been released earlier today. I will make this a more detailed post than usual because in many ways there is more to tell about this release than the average libexpat release: there is a story this time.

What is in release 2.7.0?

The key motivation for cutting a release now is to get the fix to a long-standing vulnerability out to users: I will get to that vulnerability — CVE-2024-8176 — in detail in a moment. First, what else is in this release?

There are also fixes to the two official build systems as usual, as well as improvements to the documentation.

There is a new fuzzer xml_lpm_fuzzer by Mark Brand that OSS-Fuzz has already started to include with their daily continuous fuzzing; the fuzzer is based on Clang's libFuzzer and Google's libprotobuf-mutator (LPM) that applies a variant of coverage-guided fuzzing called structured fuzzing. A side job of integrating that new fuzzer was making dependency libprotobuf-mutator support the versions of Protobuf that are shipped by Ubuntu 24.04, 22.04 and 20.04: my related work upstream is available to everyone.

Another interesting sideshow of this release is the (harmless) TOCTTOU issue that was uncovered by static analysis in a benchmarking helper tool shipped next to core libexpat. If you have not heard of that class of race condition vulnerability but are curious, the related pull request could be of interest: it is textbook TOCTTOU in a real-world example.

One other thing that is new in this release is that Windows binaries are now built by GitHub Actions rather than AppVeyor and not just 32bit but also 64bit. I have added 64bit binaries post-release to the previous release Expat 2.6.4 already on January 21st, but only now it is becoming a regular part of the release process.

The vulnerability report

So what is that long-standing vulnerability about? In July 2022 — roughly two and a half years ago — Jann Horn of Google Project Zero and Spectre/Meltdown fame reached out to me via e-mail with a finding in libexpat, including an idea for a fix.

What he found can be thought of as "the linear version of billion laughs" — a linear chain (of so-called general entities) rather than a tree — like this:

<!DOCTYPE doc [
 <!ENTITY g0 ''>
 <!ENTITY g1 '&g0;'>
 <!ENTITY g2 '&g1;'>
]>
<doc>&g2;</doc>

Except not with two (or three) levels, but thousands. Why would a chain of thousands of entity references be a problem to libexpat? Because of recursion, because of recursive C function calls: each call to a function increases the stack, and if functions are calling each other recursively, and attacker-controlled input can influence the number of recursive calls, then with the right input, attackers can force the stack to overflow into the heap: stack overflow, segmentation fault, denial of service. It depends on the stack size of the target machine how many levels of nesting it takes for this to hit: 23,000 levels of nesting would be enough to hit on one machine, but not another.

The education that introduces or leads people towards recursion should come with a warning; recursion is not just beautiful, a thinking tool and allowing for often simpler solutions — it also has a dark side to it: a big inherent security problem. The article The Power of 10: Rules for Developing Safety-Critical Code warned about the use of recursion in 2006, but Expat development already started in 1997.

Already in that initial e-mail, Jann shared what he considered the fix — avoiding (or resolving) recursion — and there was a proof-of-concept patch attached of how that could be done in general. Unlike other Project Zero findings, there would be no 90-days-deadline for this issue, because — while stack clashing was considered and is a theoretical possibility — denial of service was considered to be the realistic impact. It should be noted that this risk assessment comes without any guarantees.

The vulnerability process

Two things became apparent to me:

  1. It seemed likely that this vulnerability had multiple "faces" or variants, and that the only true fix would indeed be to effectively remove all remaining recursion from Expat. It is not the first time that recursion has been an issue in C software, or even libexpat in particular: Samanta Navarro resolved vulnerable recursion in a different place in libexpat code in February 2022 already. Thanks again!
  2. That it would be a pile of work, not a good match to my unpaid voluntary role in Expat as an addition to my unrelated-to-Expat day job, and not without risk without a partner at detail level on the topic. My prior work on fixing billion laughs for Expat 2.4.0 made me expect this to be similar, but bigger.

And with that expectation, the issue started aging without a fix, and in some sense, I felt paralyzed about the topic and kept procrastinating about it for a long time. Every now and the topic came up with my friend, journalist and security researcher Hanno Böck whom I had shared the issue with. He was arguing that even without a fix, the issue should be made public at some point.

One reason why I was objecting to publication without a fix was that it was clear that in lack of a cheap clean fix, vendors and distributions would start applying quick hacks that would produce false positives (i.e. rejecting well-formed benign XML misclassified as an attack), leave half of the issue unfixed, and leave the ecosystem with a potentially heterogeneous state of downstream patches where — say — in openSUSE a file would be rejected but in Debian it would parse fine — or the other way around: a great mess.

I eventually concluded that the vulnerability could not keep sitting in my inbox unfixed for another year, that it needed a fix before publication to not cause a mess, and that I had to take action.

Reaching out to companies for help

In early 2024, I started considering ways of finding help more, and added a call for help banner to the change log that was included with Expat 2.6.2. I started drafting an e-mail that I would send out to companies known to use libexpat in hardware. I had started maintaining a (by no means complete) public list of companies using Expat in hardware that now came in handy.

On April 14th, 2024 I started finding looking for security contacts for companies on that list. For some, it was easy to find and for others, I gave up eventually; for some, I am still not sure whether I got the right address or whether they are ghosting me as part of an ostrich policy. I wish more companies would start serving /.well-known/security.txt; finding vulnerability report contacts is still actual work in 2025 and should not be.

So then I mailed to circa 40 companies using a template, like this:

Hello ${company},


this e-mail is about ${company} product IT security.
Are you the right contact for that?  If not please forward
it to the responsible contact within ${company} — thank you!

On the security matter:

It has come to my attention that ${company} products and
business rely on libexpat or the "Expat" XML parser library,
e.g. product ${product} is using libexpat according to
document [1].  I am contacting you as the maintainer of
libexpat and its most active contributor for the last 8
years, as can be seen at [2]; I am reaching out to you today
to raise awareness that:

- All but the latest release of libexpat (2.6.2) have
  security issues known to the public, so every product
  using older versions of libexpat can be attacked through
  vulnerable versions of libexpat.

- Both automated fuzzing [3] and reports from security
  researchers keep uncovering vulnerabilities in libexpat,
  so it needs a process of updating the copy of libexpat
  that you bundle and ship with your products, if not
  already present.

- My time on libexpat is unfunded and limited, and there is
  no one but me to constantly work on libexpat security and
  to also progress on bigger lower priority tasks in
  libexpat.

- There is a non-public complex-to-fix security issue in
  libexpat that I have not been able to fix alone in my
  spare time for months now, that some attackers may have
  managed to find themselves and be actively exploiting
  today.

I need partners in fixing that vulnerability.
Can ${company} be a partner in fixing that vulnerability,
so that your products using libexpat will be secure to use
in the future?

I am looking forward to your reply, best



Sebastian Pipping

Maintainer of libexpat


[1] ${product_open_source_copyright_pdf_url}
[2] https://github.com/libexpat/libexpat/graphs/contributors
[3] https://en.wikipedia.org/wiki/Fuzzing

Replies are coming in

The responses I got from companies were all over the map:

  • My "favorite" reply was "We cannot understand what you want from us" when everyone else had understood me just fine. Nice!

  • Though that competes with the reply "A patch will be released after the fall." when they had not received any details from me. Okay!

  • There was arguing that the example product that I had mentioned was no longer receiving updates (rather than addressing their affected other products that are not end-of-life and continue to use libexpat).

  • I was asked to prove a concrete attack on the company's products (which would not scale, need access to the actual product, etc).

  • That they "do not have sufficient resources to assist you on this matter even if libexpat is employed in some of .....'s products" came back a few times.

It was interesting and fun in some sense, and not fun in another.

Next stop: confidentiality

What came next was that I asked companies to sign a simple freeform NDA with me. Companies were not prepared for that. Why was I asking for an NDA and TLP:RED? To (1) make sure that who got the details would need to collaborate on a true fix and not just monkey-patch their own setups and (2) to avoid the scenario of heterogeneous trouble fixes that I mentioned before that would have been likely in case of a leak before there was a true fix.

Some discussions failed at NDA stage already, while others survived and continued to video calls with me explaining Jann's findings in detail.

It is worth noting that I knew going in that many vulnerability reward programs exclude the whole class of denial of service and so I tied sharing the expected impact to signing an NDA to reduce the chances of everyone discarding it "Oh 'just' denial of service, we'll pass".

The eventual team and security work

Simplifying a bit, I found two main partner companies in this: Siemens and a company that would not like to be named, let's call them "Unnamed Company". Siemens started development towards a candidate fix, and Unnamed Company started evaluating options of what other companies they could pay to help for them, which got Linutronix and also Red Hat involved.

Siemens took the builder role while Linutronix, Red Hat and I provided quality assurance of various kinds. While we did not work day and night, it is fair to say that we have been working on the issue since May 2024 — for about 10 months.

The three faces of the vulnerability

It did indeed turn out that the vulnerability has multiple — three — faces:

1. General entities in character data

<!DOCTYPE doc [
 <!ENTITY g0 ''>
 <!ENTITY g1 '&g0;'>
 <!ENTITY g2 '&g1;'>
]>
<doc>&g2;</doc>

2. General entities in attribute values

<!DOCTYPE doc [
 <!ENTITY g0 ''>
 <!ENTITY g1 '&g0;'>
 <!ENTITY g2 '&g1;'>
]>
<doc key='&g2;'/>

3. Parameter entities

<!DOCTYPE doc [
 <!ENTITY % p0 ''>
 <!ENTITY % p1 '&#37;p0;'>
 <!ENTITY % p2 '&#37;p1;'>
 <!ENTITY % define_g0 "<!ENTITY g0 '&#37;p2;'>">
 %define_g0;
]>
<doc/>

The third variant "Parameter entities" reuses ideas from my 2013 exploit for vulnerability Parameter Laughs (CVE-2021-3541): It used the same mechanism of delayed interpretation.

Conclusions and gratitude

It is no overstatement to say that without Berkay Eren Ürün — the main author of the fix — and his manager Dr. Thomas Pröll at Siemens there would be no fix today: a big and personal "thank you!" from me.

Thanks to Unnamed Company, to Linutronix, to Red Hat for your help making this plane fly!

Thanks to Jann Horn for his whitehat research and the demo patch that lead the path to a fix!

Thanks to everyone who contributed to this release of Expat!

And please tell your friends:

Please leave recursion to math and keep it out of (in particular C) software: it kills and will kill again.
Kind regards from libexpat, see CVE-2022-25313 and CVE-2024-8176 for proof.

For more details about this release, please check out the change log.

If you maintain Expat packaging or a bundled copy of Expat or a pinned version of Expat somewhere, please update to 2.7.0. Thank you!

Sebastian Pipping


Read the original article

Comments

  • By DonHopkins 2025-03-149:002 reply

    I hope the term "deyodawgification" enters the pantheon of elegant code purification terms of art, alongside the classics like "degotofying", "deCOMtamination", and "outparamdelling".

    https://knowyourmeme.com/memes/xzibit-yo-dawg

    https://wiki.mozilla.org/Gecko:DeCOMtamination_Algorithm

    https://bugzilla.mozilla.org/show_bug.cgi?id=455943

    https://news.ycombinator.com/item?id=22708241

    Who knows (or can invent) any other good ones:

    Deadlocksmithing, JSONcology, YAMLectomy, XMLimination, DeDTDification, DeXMLEntitification, ExCDATAration, AntiPOJOification, SOAPerfluousness Scrubbing, WSDLectomy, Delogorrhea, DeK8ification, HelmholtzianRechartification, DevOpsDevangelicalism, FactoryFactoryDefactoringDefactoring, Antiquasimonolithicmicrofragmentedmacrodecontainerizationarianism...

    • By immibis 2025-03-1419:48

      "Soft lock picking" was the title of a YouTube series about strange ways out of impossible save files (soft locks, as opposed to hard locks where the game freezes) in video games.

    • By ggambetta 2025-03-149:17

      DTDetox

  • By preinheimer 2025-03-1323:211 reply

    I think this was a really brave call for help from the writer. They needed help and they asked for it, from strangers!

    • By spyc 2025-03-1323:54

      Thank you!

  • By kibwen 2025-03-1412:5611 reply

    I'm interested in examining the idea of a programming language that eschews the notion of a callstack and returns to having a single fixed activation record per function. This obviously places a lot of limits on language design, but in return you get a statically-guaranteed upper bound on memory usage, plus weirdo stuff like the ability to hand out references to local data in functions that have already returned (which remains valid as long as you don't call the function again, which I think should be possible to enforce via a borrow checker).

    • By openasocket 2025-03-1415:562 reply

      There's prior art in this, very very old prior art. If you check out the first volume of The Art of Computer Programming, it uses a fictitious architecture called MIX. I understand that while it was fictitious, it was made to be similar enough to contemporary architectures. In it there are no special stack manipulation instructions, because there is no first-class notion of a stack! Functions used global variables for their scratch space. To call a function you would just jump to that address. Of course that function needed to know where to jump back to when complete. To do that, before jumping, the caller would WRITE THE RETURN ADDRESS TO THE JUMP INSTRUCTION AT THE END OF THE FUNCTION. This seems kind of insane in the modern day (function calls requiring self-modifying code!) but it meant you could implement functions without needing even a single extra word of storage space, and all you really gave up was recursion. I believe the original Fortran and some other older languages were originally implemented this way.

      There's definitely an advantage with this idea, but also some downsides. While you have a guaranteed upper bound on memory usage, it's completely static and actually going to be worse than doing the equivalent on the stack. Suppose you have 100 functions that each need 100 bytes of scratch space. Statically allocating everything like you describe means you need 10KB of memory reserved. But suppose there is one main function, and the rest are helper functions, and the main function calls each helper function one at time and the helper functions don't call anything. With the static approach, you still need 10KB. With the stack-based approach, you only need 200 bytes (the stack space for the main function, and the stack space for the helper function it is currently calling). Another advantage to the stack-based approach is due to caching. In the static approach, whether or not a function's scratch space is in the cache or not depends on how often that function is called. But in the stack-based approach, the top of the stack is almost always in the L1 cache, giving a speed boost to functions, even when they are not called frequently.

      That said, I do think it is an interesting idea that is worth exploring.

      • By NobodyNada 2025-03-1418:031 reply

        > Suppose you have 100 functions that each need 100 bytes of scratch space. Statically allocating everything like you describe means you need 10KB of memory reserved. But suppose there is one main function, and the rest are helper functions, and the main function calls each helper function one at time and the helper functions don't call anything. With the static approach, you still need 10KB. With the stack-based approach, you only need 200 bytes (the stack space for the main function, and the stack space for the helper function it is currently calling).

        This is not necessarily the case: if you statically analyze the call graph, you may be able to prove that two functions will never be live at the same time, and thus reuse memory regions for their local variables: https://github.com/llvm-mos/llvm-mos/blob/main/llvm/lib/Targ...

        Of course, recursion, function pointers, and dynamic linking will defeat (or at least complicate) this scheme.

        • By openasocket 2025-03-1419:14

          That's a good point, you can definitely optimize. I'd be curious how far you could go using static analysis. Assuming no recursion, function pointers, dynamic linking, etc; can you optimize to the point that you use as little memory as a stack approach would? I think you ALMOST can! The only blocker would be specific call patterns that would require a lot of memory, but are unreachable. That's where the halting problem hits.

          You just make a call graph where the nodes are functions and directed edges indicate which function calls which. Without recursion function pointers and the like you can compute this exactly and this forms a DAG. Each node is given a weight, which is the amount of local memory that function needs. Start at the entry point function, and compute the longest distance path (using the node weight).

      • By fpoling 2025-03-1418:101 reply

        Original Fortran did not modified the code. Rather each function had a global variable storing the address to return that the caller had to set.

        • By pklausler 2025-03-1418:36

          The CDC 6600 of blessed memory had "return jump" instruction that wrote an unconditional jump back to the return address into the word prior to the first instruction of the called procedure. So one would implement a subroutine return by jumping to the word before the entry point.

          (Fortran programmers are probably wondering how alternate ENTRY statements worked; yes, they had to allocate a word before the ENTRY, and copy whatever jump instruction was placed there by the hardware into the main subroutine's jump word, so that RETURN or END would work for all entry points. Recursive languages like Pascal had to copy that word to and from the stack. Reentrant code had to avoid using the return jump instruction entirely.)

    • By cyco130 2025-03-1419:13

      As others have pointed out, that's pretty much how things originally started. A more recent example I remember is the Action! programming language for Atari 8-bit computers[1], an Algol descendent for 6502 with static activation frames.

      But I don't understand the appeal. Many algorithms and data structures are most naturally expressed with recursion. Not allowing it forces the programmer to create and maintain their own manual stack data structures which brings us to square one in terms of statically-guaranteed upper bound on memory usage since they can grow indefinitely. At best, stack overflow errors will be replaced by array out of bounds errors which is the exact same thing anyway. In fact, it will be even worse: Manual stack implementation will come with its own complexity and source of bugs.

      [1] https://en.wikipedia.org/wiki/Action!_(programming_language)

    • By kazinator 2025-03-1423:33

      > I'm interested in examining the idea of a programming language that eschews the notion of a callstack

      Chicken Scheme fits the bill. While it doesn't eschew the notation of a call stack, it turns it on its ear.

      In Chicken Scheme

      1. All objects are allocated on the stack. Strings, conses, symbols, lexical environments, ...

      2. Functions do not return. Everything is compiled to C which uses continuation passing style. What looks like returning in the Scheme source code is a continuation call: the parent function passes a continuation, and so the function return invokes that and runs that lambda.

      3. Because functions do not return and everything is allocated from the stack, the stack grows voraciously. When it hits a certain limit, a kind of ephemeral garbage collection takes place: all objects that have been allocated on the stack which are still live are transferred to the heap. Then ... the stack pointer is rewound.

      Essentially, the C stack has become a bump allocator in a kind of generational GC.

    • By adrian_b 2025-03-1416:371 reply

      That programming language exists.

      The old FORTRAN and COBOL language versions were like this.

      In ancient FORTRAN programs, the programmer typically computed a maximum possible size for the data and allocated statically in the main program one or more work arrays corresponding to that maximum size, which were either placed in a COMMON block of global variables or they were passed as arguments to the invoked functions and subroutines.

      In the functions or subroutines, the work arrays were typically partitioned by allocating in them various local variables.

      Overall, this approach was more reliable, because there were no risks of exceeding the memory limits, but it was much more tedious for the programmer, because the worst cases for memory usage had to be accurately predicted.

      • By guenthert 2025-03-1416:44

        > Overall, this approach was more reliable, because there were no risks of exceeding the memory limits, but it was much more tedious for the programmer, because the worst cases for memory usage had to be accurately predicted.

        Well, most importantly, it fixated the program's limits. Got a bigger machine and wanted to tackle greater data-sets? You're out of luck, if you didn't have the source code. Old Unix programs were often like that. It were the GNU counterparts which often did away with such arbitrary limits allowing systems to grow and the software to stay relevant.

    • By amavect 2025-03-1420:12

      >plus weirdo stuff like the ability to hand out references to local data in functions that have already returned (which remains valid as long as you don't call the function again, which I think should be possible to enforce via a borrow checker).

      The C programming language supports this with the static keyword. Further calls may overwrite the pointed data. I have played with allocating fixed-size data in static locals, but I always found that allocating in the caller does the same job better. For example, compare strerror() with strerror_s(). (A sensible strerror implementation should return a pointer to static immutable data, but the Standard doesn't specify it.)

      A procedural language can achieve a statically bounded call stack by restricting self recursion, mutual recursion, and function pointers. I struggle to imagine how a language without a call stack could perform differently or better.

    • By Someone 2025-03-1413:491 reply

      > I'm interested in examining the idea of a programming language that eschews the notion of a callstack

      Technically, I don’t know any language whose spec mentions the notion of a call stack. For example, it’s perfectly OK for a conforming C compiler to use a linked list of dynamically allocated activation records (from a standards view point; users of such an implementation may have different opinions)

      A conforming C compiler also could choose to only use a call stack for functions that take part in recursive calls or that may get called by multiple threads.

      > plus weirdo stuff like the ability to hand out references to local data in functions that have already returned (which remains valid as long as you don't call the function again, which I think should be possible to enforce via a borrow checker).

      If you add multi-threading (something that is almost a must have for languages on powerful CPUs nowadays), I don’t think that’s easy to do.

      • By layer8 2025-03-1414:241 reply

        I don’t think that “call stack” implies “contiguous memory”, or “what the operating system might think of a process call stack”, so a linked list would still qualify as a call stack. While the C standard doesn’t use the word “stack”, it explicitly requires support for recursive function calls, and the related semantics it specifies with regard to automatic storage duration effectively describe a stack.

        • By Someone 2025-03-1514:00

          If you want to ignore such implementation details, you’re basically saying you want to see a language that doesn’t allow recursion, or maybe even one that allows recursion, but only in such a way that the compiler can compute the maximum amount of memory needed for activation records.

          In a language that doesn’t allow recursion, using a call stack still can make sense for the implementation because it allows reuse of memory for local variables between functions that do not call each other, directly or indirectly.

          But then, you’re giving up “plus weirdo stuff like the ability to hand out references to local data in functions that have already returned”, although, thinking of it, static analysis could make those locals survive by manipulating the stack pointer (if you have the caller allocate and, reallocate the locals of the called function, the caller can postpone that reallocation if it wants to access the locals of the called function). I’m not sure that weirdo stuff is worth the effort, though. It’s just a weird way to return multiple values from a function.

    • By fc417fc802 2025-03-1413:541 reply

      > single fixed activation record per function

      What about statically determining a fixed number of activation records at compile time? Similar in spirit to security focused languages that require loops to have a statically determined finite bound in order to successfully compile.

      As to lifetime after returning, do you really hate continuations that much?

      • By kibwen 2025-03-1415:12

        > What about statically determining a fixed number of activation records at compile time?

        Sure, it may be useful to represent a function's local storage as a first-class concept, and then allow users to instantiate copies of the function at will, if they're willing to allocate more storage themselves, thereby allowing users to either precisely limit the number of instances of a function or otherwise use dynamic allocation to manually reimplement a callstack if they so choose.

        > As to lifetime after returning, do you really hate continuations that much?

        This is a language that forbids recursion, the functional programmers have already run screaming for the exits. :P

    • By hakfoo 2025-03-156:53

      Does it require a new language, or could it be enforced by simple discipline or linting?

      I'd suspect it makes sense in an embedded bare-metal environment where you know all the data structures you might ever have in circulation at once. This sometimes ties into things like explicitly unchanging data as "this can be allocated to ROM" to tightly manage the limited RAM.

    • By cypherpunk666 2025-03-1416:20

      related thoughts perhaps? http://lambda-the-ultimate.org/node/5555

      (be warned that the old site links are slow, one has to wait for the unarchiving to run, or something.)

    • By bsder 2025-03-1418:43

      > I'm interested in examining the idea of a programming language that eschews the notion of a callstack and returns to having a single fixed activation record per function.

      Welcome to the old 8-bit C compilers.

      Everything was statically allocated so no recursion. The downside is that you can only have a single thread of execution.

      It would be interesting to see what that would look like if extended to multiple threads (they would have to be defined/allocated at compile time).

HackerNews