Why does C have the best file API

2026-03-0119:25162160maurycyz.com

2026-02-28 — 2026-03-01 (Programming) (Rants) Ok, the title is a tongue-in-cheek, but there's very little thought put into files in most languages. It always feels a bit out of place... except in C.…

Show article

2026-02-28 — 2026-03-01 (Programming) (Rants)

Ok, the title is a tongue-in-cheek, but there's very little thought put into files in most languages. It always feels a bit out of place... except in C. In fact, what you get is usually a worse version of C.

In C, files can be accessed in the same way as memory:

#include <sys/mman.h>
#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h> void main() { // Create/open a file containing 1000 unsigned integers // Initialized to all zeros. int len = 1000 * sizeof(uint32_t); int file = open("numbers.u32", O_RDWR | O_CREAT, 0600); ftruncate(file, len); // Map it into memory. uint32_t* numbers = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, file, 0); // Do something: printf("%d\n", numbers[42]); numbers[42] = numbers[42] + 1; // Clean up munmap(numbers, len); close(file);
}

Memory mapping isn't the same as loading a file into memory: It still works if the file doesn't fit in RAM. Data is loaded as needed, so it won't take all day to open a terabyte file.

It works with all datatypes and is automatically cached. This cache is cleared if the system needs memory for something else.

mmap() is actually a OS feature, so many other languages have it. However, it's almost always limited to byte arrays: You have to grab a chunk of data, parse, process and finally serialize it before writing back to the disk. It's nicer then manually calling read() and write(), but not by much.

These languages have all these nice features for manipulating data in memory, but nothing for manipulating data on disk. In memory, you get dynamically sized strings and vectors, enumerated types, objects, etc, etc. On disk, you get... a bunch of bytes.

Considering that most already support custom allocators and the such, adding a better way to access files seems very doable, but no one's actually done it. It's very weird to me that C — a language known for being unergonomic — actually does this the best.

C's implementation isn't even very good: Memory mapping comes with some overhead (page faults, TLB flushes) and C does nothing to handle endianness or errors... but it doesn't take much to beat nothing.

Sure, you might want to do some parsing and validation, but it shouldn't be required every time data leaves the disk. RAM is much smaller then the disk, so it's often impossible to just parse everything into memory.

A lot of files are not untrusted data.

In the case of binary files, parsing is usually redundant. There's no reason code can't directly manipulate the on-disk representation, and for "scratchpad" temporary files, save the data as it exists in RAM. Sure, you wouldn't want to directly manipulate JSON, but there's no reason to do a bunch of work to save some integers.

File manipulation is similarly neglected. The filesystem is the original NoSQL database, but you seldom get more then a wrapper around C's readdir().

This usually results in people running another database, such as SQLite, on top of the filesystem, but relational databases never quite fit your program.

... and SQL integrates even worse than files: On top of having to serialize all your data, you have to write code in a whole separate language just to access it!

Most programmers will use it as a key-value store, and implement their own indexing: creating a bizarre triple nested database.

Read the original article

Comments

By amluto 2026-03-025:379 reply

I can’t entirely tell what the article’s point is. It seems to be trying to say that many languages can mmap bytes, but:

> (as far as I'm aware) C is the only language that lets you specify a binary format and just use it.

I assume they mean:

    struct foo { fields; };
    foo *data = mmap(…);

And yes, C is one of relatively few languages that let you do this without complaint, because it’s a terrible idea. And C doesn’t even let you specify a binary format — it lets you write a struct that will correspond to a binary format in accordance with the C ABI on your particular system.

If you want to access a file containing a bunch of records using mmap, and you want a well defined format and good performance, then use something actually intended for the purpose. Cap’n Proto and FlatBuffers are fast but often produce rather large output; protobuf and its ilk are more space efficient and very widely supported; Parquet and Feather can have excellent performance and space efficiency if you use them for their intended purposes. And everything needs to deal with the fact that, if you carelessly access mmapped data that is modified while you read it in any C-like language, you get UB.

By gopalv 2026-03-026:097 reply

> correspond to a binary format in accordance with the C ABI on your particular system.

We're so deep in this hole that people are fixing this on a CPU with silicon.

The Graviton team made a little-endian version of ARM just to allow lazy code like this to migrate away from Intel chips without having to rewrite struct unpacking (& also IBM with the ppc64le).

Early in my career, I spent a lot of my time reading Java bytecode into little endian to match all the bytecode interpreter enums I had & completely hating how 0xCAFEBABE would literally say BE BA FE CA (jokingly referred as "be bull shit") in a (gdb) x views.

By api 2026-03-0213:072 reply

ARM is usually bi-endian, and almost always run in little endian mode. All Apple ARM is LE. Not sure about Android but I’d guess it’s the same. I don’t think I’ve ever seen BE ARM in the wild.

Big endian is as far as I know extinct for larger mainstream CPUs. Power still exists but is on life support. MIPS and Sparc are dead. M68k is dead.

X86 has always been LE. RISC-V is LE.

It’s not an arbitrary choice. Little endian is superior because you can cast between integer types without pointer arithmetic and because manually implemented math ops are faster on account of being linear in memory. It’s counter intuitive but everything is faster and simpler.

Network data and most serialization formats are big endian by convention, a legacy from the early net growing on chips like Sparc and M68k. If it were redone now everything would be LE everywhere.

By amluto 2026-03-0214:431 reply

> Little endian is superior because you can cast between integer types without pointer arithmetic

I’ve heard this one several times and it never really made sense. Is the argument that y you can do:

    short s;
    long *p = (long*)&s;

Or vice versa and it kind of works under some circumstances?

By elehack 2026-03-0214:481 reply

Yes. In little-endian, the difference between short and long at a specific address is how many bytes you read from that address. In big-endian, to cast a long to a short, you have to jump forward 6 bytes to get to the 2 least-significant bytes.

By lxgr 2026-03-0218:05

Wow, I've been living life assuming that little endian was just the VHS of byte orders with no redeeming qualities whatsoever until today. This actually makes sense, thank you!

By hyc_symas 2026-03-0219:151 reply

Network data and most serialization formats are big endian because it's easiest to shift bits in and out of a shift register onto a serial comm channel in that order. If you used little endian, the shifter on output would have to operate in reverse direction relative to the shifter on input, which just causes stupid inconsistency headaches.

By jcelerier 2026-03-031:39

Isn't the issue with shift registers related to endianness at the bit level, while the discourse above is about endianness at the byte level? Both are pretty much entirely separate problems

By wahern 2026-03-026:54

GCC supports specifying endianness of structs and unions: https://gcc.gnu.org/onlinedocs/gcc-15.2.0/gcc/Common-Type-At...

I'm not sure how useful it is, though it was only added 10 years ago with GCC 6.1 (recent'ish in the world of arcane features like this, and only just about now something you could reasonably rely upon existing in all enterprise environments), so it seems some people thought it would still be useful.

By torginus 2026-03-028:131 reply

I thought all iterations of ARM are little endian, even going back as far to ARM7. same as x86?

The only big-endian popular arch in recent memory is PPC

By masklinn 2026-03-028:341 reply

AFAIK ARM is generally bi-endian, though systems using BE (whether BE32 or BE8) are few and far between.

By kevin_thibedeau 2026-03-0213:51

It started as LE and added bi-endian with v3.

By direwolf20 2026-03-032:46

ARM has always been little-endian. Some were configurable endian.

And it's not a hole. We're not about to spend 100 cycles parsing a decimal string that could have been a little-endian binary number, just because you feel a dependency on a certain endianness is architecturally impure. Know what else is architecturally impure? Making binary machines handle decimal.

By yjftsjthsd-h 2026-03-0223:07

> The Graviton team made a little-endian version of ARM just to allow lazy code like this to migrate away from Intel chips without having to rewrite struct unpacking

No? Most ARM is little endian.

By ozgrakkurt 2026-03-038:39

I would question why is it big endian in the first place. Little endian is obviously more popular, why use big endian at all?

By zombot 2026-03-0212:43

Fuck, the stupidity of humans really is infinite.

By dvt 2026-03-025:592 reply

Had the same thought. Also confused at the backhanded compliment that pickle got:

> Just look at Python's pickle: it's a completely insecure serialization format. Loading a file can cause code execution even if you just wanted some numbers... but still very widely used because it fits the mix-code-and-data model of python.

Like, are they saying it's bad? Are they saying it's good? I don't even get it. While I was reading the post, I was thinking about pickle the whole time (and how terrible that idea is, too).

By benj111 2026-03-0214:13

A thing can be good and bad. Everything is a tradeoff. The reason why C is 'good' in this instance is the lack of safety, and everything else that makes C, C (see?) but that is also what makes C bad.

By zadikian 2026-03-026:43

The article is saying it's good, or at least good enough. I don't necessarily agree with the rest of the article.

By pjmlp 2026-03-028:181 reply

Yeah, and as you well put it, it isn't even some snowflake feature only possible in C.

The myth that it was a gift from Gods doing stuff nothing else can make it, persists.

And even on the languages that don't, it isn't if as a tiny Assembly thunk is the end of the world to write, but apparently at a sign of a plain mov people run to the hills nowadays.

By ErroneousBosh 2026-03-0210:37

> And even on the languages that don't, it isn't if as a tiny Assembly thunk is the end of the world to write, but apparently at a sign of a plain mov people run to the hills nowadays.

Use the right tool for the job. I've always felt it's often the most efficient thing to write a bit of code in assembler, if that's simpler and clearer than doing anything else.

It's hard to write obfuscated assembler because it's all sitting opened up in front of you. It's as simple as it gets and it hasn't got any secrets.

By socalgal2 2026-03-027:461 reply

it's not a terrible idea. It has it's uses. You just have to know when to use it and when not to use it.

For example, to have fast load times and zero temp memory overhead I've used that for several games. Other than changing a few offsets to pointers the data is used directly. I don't have to worry about incompatibilities. Either I'm shipping for a single platform or there's a different build for each platform, including the data. There's a version in the first few bytes just so during dev we don't try to load old format files with new struct defs. But otherwise, it's great for getting fast load times.

By cb321 2026-03-029:51

To support your point, it's also used in basically every shared library / DLL system. While usually used "for code", a "shared pure data library" has many applications. There are also 3rd party tools to make this convenient from many PLangs like HDF5, https://github.com/c-blake/nio with its FileArray for Nim, Apache Arrow, etc.

Unmentioned so far is that defaults for max live memory maps are usually much higher than defaults for max open files. So, if you are careful about closing files after mapping, you can usually get more "range" before having to move from OS/distro defaults. (E.g. for `program foo*`-style work where you want to keep the foo open for some reason, like binding them to many read-only NumPy array variables.)

By ozgrakkurt 2026-03-038:38

Mapping a struct from binary buffers is actually a very good idea if you know how it works.

Flatbuffers etc. is cool but they can be very bloaty and clunky.

By hyc_symas 2026-03-0320:44

How often does anyone care about using data on a different system than it was created on?

These days, any C struct you built on amd64 will work identically on arm64. There really aren't any other architectures that matter.

And yes, managing concurrent access to shared resources requires care and cooperation. That has always been true, and has nothing specific to do with mmap.

By rerdavies 2026-03-0313:14

mmap is not part of ISO C. mmap is part of POSIX 2008, but MSVC/Windows does not support it.

By jklowden 2026-03-0214:361 reply

It’s a terribly useful idea. FTFY.

The program you used to leave your comment, and the libraries it used, were loaded into memory via mmap(2) prior to execution. To use protobuf or whatever, you use mmap.

The only reason mmap isn’t more generally useful is the dearth of general-use binary on-disk formats such as ELF. We could build more memory-mapped applications if we had better library support for them. But we don’t, which I suppose was the point of TFA.

By amluto 2026-03-0214:471 reply

Entire libraries are a weird sort of exception. They fundamentally target a specific architecture, and all the nonportable or version dependent data structures are self describing in the sense that the code that accesses them are shipped along with the data.

And if you load library A that references library B’s data and you change B’s data format but forget to update A, you crash horribly. Similarly, if you modify a shared library while it’s in use (your OS and/or your linker may try to avoid this), you can easily crash any process that has it mapped.

By zephen 2026-03-034:18

> Entire libraries are a weird sort of exception.

Not really. The entire point of the article is that there are a lot of problem domains where data stays on a single machine, or at least a single type of machine.

By Negitivefrags 2026-03-025:593 reply

Why is it such a terrible idea?

No need to add complexity, dependancies and reduced performance by using these libraries.

By amluto 2026-03-026:052 reply

Lots of reasons:

The code is not portable between architectures.

You can’t actually define your data structure. You can pretend with your compiler’s version of “pack” with regrettable results.

You probably have multiple kinds of undefined behavior.

Dealing with compatibility between versions of your software is awkward at best.

You might not even get amazing performance. mmap is not a panacea. Page faults and TLB flushing are not free.

You can’t use any sort of advanced data types — you get exactly what C gives you.

Forget about enforcing any sort of invariant at the language level.

By Negitivefrags 2026-03-026:113 reply

I've written a lot of code using that method, and never had any portability issues. You use types with number of bits in them.

Hell, I've slung C structs across the network between 3 CPU architectures. And I didn't even use htons!

Maybe it's not portable to some ancient architecture, but none that I have experienced.

If there is undefined behavior, it's certainly never been a problem either.

And I've seen a lot of talk about TLB shootdown, so I tried to reproduce those problems but even with over 32 threads, mmap was still faster than fread into memory in the tests I ran.

Look, obviously there are use cases for libraries like that, but a lot of the time you just need something simple, and writing some structs to disk can go a long way.

By pjmlp 2026-03-028:001 reply

Some people also don't use protective gear when going downhill biking, it is a matter of feeling lucky.

By benj111 2026-03-0214:221 reply

On the other hand some people have things to ward off evil demons, and aren't bothered by evil demons.

The parent has actually done the thing, and found no issues, I don't think you can hand wave that away with a biased metaphor.

Otherwise you get 'Goto considered harmful' and people not using it even when it fits.

By pjmlp 2026-03-0214:331 reply

As proven by many languages without native support for plain old goto, it isn't really required when proper structured programming constructs are available, even if it happens to be a goto under the hood, managed by the compiler.

By benj111 2026-03-0218:411 reply

My point is it's bad debating style. 'Everyone knows C is bad for all kinds of reasons ergo, even when someone presents their own actual experience, I can respond with a refrain that sounds good'

Not using goto because you've heard it's always bad is the same kind of thing. Yes it has issues, but that isn't a reason to brush anyone off that have actual valid uses for it.

By pjmlp 2026-03-0220:08

Since I am coding since 1986, lets say I have plenty of experience with goto in various places myself.

By ddtaylor 2026-03-027:211 reply

C allows most of this, whereas C++ doesn't allow pointer aliasing without a compiler flag, tricks and problems.

I agree you can certainly just use bytes of the correct sizes, but often to get the coverage you need for the data structure you end up writing some form of wrapper or fixup code, which is still easier and gives you the control versus most of the protobuf like stuff that introduces a lot of complexity and tons of code.

By nly 2026-03-029:332 reply

__attribute__((may_alias, packed)) right on the struct.

By amluto 2026-03-0214:50

Check your generated code. Most compilers assume that packed also means unaligned and will generate unaligned load and store sequences, which are large, slow, and may lose whatever atomicity properties they might have had.

By johannes1234321 2026-03-029:521 reply

That is not C, but a non-standard extension and thus not portable.

By josefx 2026-03-0210:341 reply

> non-standard extension and thus not portable

Modern versions of standard C aren't very portable either, unless you plan to stick to the original version of K&R C you have to pick and choose which implementations you plan to support.

By ddtaylor 2026-03-0210:391 reply

I disagree. Modern C with C17 and C23 make this less of an issue. Sure, some vendors suck and some people take shortcuts with embedded systems, but the standard is there and adopted by GCC, Clang and even MSVC has shaped up a bit.

By josefx 2026-03-0211:041 reply

> GCC, Clang and even MSVC

Well, if that is the standard for portability then may_alias might as well be standard. GCC and Clang support it and MSVC doesn't implement the affected optimization as far as I can find.

By ddtaylor 2026-03-0211:211 reply

What do you think the standard is for standardization?

By josefx 2026-03-0212:58

Within the context of this discussion portability was mentioned as key feature of the standard. If C23 adoption is as limited as the, possibly outdated, tables on cppreference and your comments about gcc, clang and msvc suggest then the functionality provided by the gcc attribute would be more portable than C23 conformant code. You could call it a de facto standard, as opposed to C23 which is a standard in the sense someone said so.

By lionkor 2026-03-0211:211 reply

That seems highly unlikely. Let's assume that all compilers use the exact same padding in C structs, that all architectures use the same alignment, and that endianness is made up, that types are the same size across 64 and 32 bit platforms, and also pretend that pointers inside a struct will work fine when sent across the network; the question remains still: Why? Is THIS your bottleneck? Will a couple memcpy() operations that are likely no-op if your structs happen to line up kill your perf?

By zadikian 2026-03-0219:511 reply

I guess to not have to set up protobuf or asn1. Those preconditions of both platforms using the same padding and endianness aren't that hard to satisfy if you own it all.

But do you really have such a complex struct where everything inside is fixed-size? I wouldn't be surprised if it happens, but this isn't so general-purpose like the article suggests.

By lionkor 2026-03-0223:59

There are at least 10 steps between protobuf and casting a struct to a char*.

By direwolf20 2026-03-0310:08

"Portable" has originally meant "able to be ported" and not "is already ported"

By lifthrasiir 2026-03-026:051 reply

No defined binary encoding, no guarantee about concurrent modifications, performance trade-offs (mmap is NOT always faster than sequential reads!) and more.

By josefx 2026-03-0210:53

Doesn't that just describe low level file IO in general?

By jraph 2026-03-026:061 reply

Because a struct might not serialize the same way from a CPU architecture to another.

The sizes of ints, the byte order and the padding can be different for instance.

By bloppe 2026-03-029:192 reply

C has had fixed size int types since C99. And you've always been able to define struct layouts with perfect precision (struct padding is well defined and deterministic, and you can always use __attribute__(packed) and bit fields for manual padding).

Endianness might kill your portability in theory. but in practice, nobody uses big endian anymore. Unless you're shipping software for an IBM mainframe, little endian is portable.

By P-Nuts 2026-03-0212:46

You just define the structures in terms of some e.g. uint32_le etc types for which you provide conversion functions to native endianness. On a little endian platform the conversion is a no-op.

By fc417fc802 2026-03-029:38

It can be made to work (as you point out), and the core idea is great, but the implementation is terrible. You have to stop and think about struct layout rules rather than declaring your intent and having the compiler check for errors. As usual C is a giant pile of exquisitely crafted footguns.

A "sane" version of the feature would provide for marking a struct as intended for ser/des at which point you'd be required to spell out every last alignment, endianness, and bit width detail. (You'd still have to remember to mark any structs used in conjunction with mmap but C wouldn't be any fun if it was safe.)

By seba_dos1 2026-03-0121:323 reply

mmap is not a C feature, but POSIX. There are C platforms that don't provide mmap, and on those that do you can use mmap from other languages (there's mmap module in the Python's standard library, for example).

By flohofwoe 2026-03-029:18

And it's not just mmap(), all the functions in the code snippet except printf() are not actually C stdlib functions.

By ajross 2026-03-0122:075 reply

I think this is sort of missing the point, though. Yes, mmap() is in POSIX[1] in the sense of "where is it specified".

But mmap() was implemented in C because C is the natural language for exposing Unix system calls and mmap() is a syscall provided by the OS. And this is true up and down the stack. Best language for integrating with low level kernel networking (sockopts, routing, etc...)? C. Best language for async I/O primitives? C. Best language for SIMD integration? C. And it goes on and on.

Obviously you can do this stuff (including mmap()) in all sorts of runtimes. But it always appears first in C and gets ported elsewhere. Because no matter how much you think your language is better, if you have to go into the kernel to plumb out hooks for your new feature, you're going to integrated and test it using a C rig before you get the other ports.

[1] Given that the pedantry bottle was opened already, it's worth pointing out that you'd have gotten more points by noting that it appeared in 4.2BSD.

By nickelpro 2026-03-0122:123 reply

If we're going to be pedantic, mmap is a syscall. It happens that the C version is standardized by POSIX.

The underlying syscall doesn't use the C ABI, you need to wrap it to use it from C in the same way you need to wrap it to use it from any language, which is exactly what glibc and friends do.

Moral of the story is mmap belongs to the platform, not the language.

By a-dub 2026-03-0122:14

it also appears in operating systems that aren't written in c. i see it as an operating system feature, categorically.

By ajross 2026-03-0215:381 reply

No, that's too far down the pedantry rabbit hole. "mmap()" is quite literally a C function in the 4.2BSD libc. It happens to wrap a system call of the same name, but to claim that they are different when they arrived in the same software and were written by the same author at the same time is straining the argument past the breaking point. You now have a "C Erasure Polemic" and not a clarifying comment.

If you take a kernel written in C and implement a VM system for it in C and expose a new API for it to be used by userspace processes written in C, it doesn't magically become "not C" just because there's a hardware trap in the middle somewhere.

mmap() is a C API. I mean, duh.

By nagaiaida 2026-03-0223:441 reply

and if i directly do an mmap syscall on linux from a freestanding forth that doesn't go through libc for anything? sure, c unfortunately defines how i have say, pass a string, but that's effectively an arbitrary calling convention at that point; there's no c runtime on the calling side so it's not particularly useful to contend that what i'm using is a c api.

or perhaps mmap is incontrovertibly a c function on platforms where libc wrappers are the sole stable interface to the kernel but something else entirely on linux?

By ajross 2026-03-031:391 reply

> and if i directly do an mmap syscall on linux from a freestanding forth

... mmap() remains a system call to a C kernel designed for use from the C library in C programs, and you're running what amounts to an emulator.

The fact that you can imagine[1] an environment where that might not be the case doesn't mean that it isn't the case in the real world.

Your argument appears to be one of Personal Liberty: de facto truths don't matter because you can just make your own. This is sort of a software variant of a Sovereign Citizen, I think.

[1] Can you even link a "freestanding forth" with an mmap() binding on any Unix that doesn't live above the libc implementation? I mean, absent everything else it would have to open code all the flag constants, whose values change between systems. This appears to be a completely fictitious runtime you've invented, which if anything sits as evidence in my favor and not yours.

By nagaiaida 2026-03-036:07

i'm not so much imagining an environment per se¹ as describing one i've already written, so i'm not entirely sure where any of this is coming from. if you care to have some additional assurance this isn't somehow an elaborate rhetorical trap, a previous comment about forth tail call elimination with a bit of demonstrative assembly is presumably only a short scroll down my profile. ctrl-f for cmov if you want to find it quickly. as i recall, it came up for similar reasons then because people often make similar incorrect generalizations about lots of things that implicitly sit atop a c runtime in their minds. that said, you're the first one to call me a sovcit before asking any clarifying questions so at least there's some new pizzazz there.

i was clear that i was talking specifically about linux precisely because this isn't something one can do portably for exactly the reasons you're describing (which, yes, makes porting things built like this off of linux before the point you've built up enough to be able to go through libc annoying and ad hoc at the very least).

the fact remains that i can, right now, non-theoretically, on a well supported common unixlike os, and entirely unrelated to whatever weird crusade you seem to have invented to stand in for my side of this discussion, link a pile of assembly with -static -nolibc, fire up the repl, and mmap files into memory as i please with nary a bit of c on the userspace side.

as i originally said, i'm happy to consider linux a weird exception to the point you're making in a wider context since this isn't something you can do portably, but there still are entirely useful things one can do today with mmap that involve zero userspace c code on a widely supported platform.

edit: lol forgot to even get to this part. i'm also somewhat curious what you mean with this bit: "you're running what amounts to an emulator." perhaps i'm not firing on all cylinders today but i fail to see how it's useful to characterize performing bare syscalls from assembly (or something more high-level built out of assembly legos) as an emulator in any way, but i'm open to having missed some interesting nuance there.

¹ unless you mean trivially (seeing as this is code i imagined and then proceeded to write) in which case i suppose i agree

By projektfu 2026-03-0122:34

Why does Ada have the best file API?

https://github.com/AdaCore/florist/blob/master/libsrc/posix-...

By qalmakka 2026-03-028:303 reply

> C is the natural language for exposing Unix system calls

No, C is the language _designed_ to write UNIX. Unix is older than C, C was designed to write it and that's why all UNIX APIs follow C conventions. It's obvious that when you design something for a system it will have its best APIs in the language the system is written in.

C has also multiple weird and quirky APIs that suck, especially in the ISO C libc.

By pajko 2026-03-0218:23

Here, this is not C: https://github.com/jserv/unix-v1/blob/master/src/lib/open.s

By ajross 2026-03-0215:28

>> C is the natural language for exposing Unix system calls

> No, C is the language _designed_ to write UNIX. [...]

This is one of those hilarious situations where internet discussion goes off the rails. Everything you wrote, to the last word, would carry the same meaning and the same benefit to the discussion had you written "Yes" instead of "No" as the first word.

Literally you're agreeing with me, but phrasing it as a disagreement only because you feel I left something out.

By benj111 2026-03-0214:331 reply

If I write an OS in Basic, surely the 'natural' language for exposing the system calls is Basic?

Yes Unix predates C. But at this point in time 50+ years down the road, where the majority on nix users don't use anything that ever contained that code, and the minority use a nix that has been thoroughly ship of Theseused, Unix is to all intents and purposes a C operating system.

By qalmakka 2026-03-0214:50

> If I write an OS in Basic, surely the 'natural' language for exposing the system calls is Basic?

For that specific OS, that would probably be the case? I think every API is bound to reflect the specific constraints of the language it has been written in. What I was trying to clarify was that UNIX and C are intertwined in an especially deep way, more than basically other OS that doesn't have a UNIX API, because both were born and written alongside each other, so some Unix APIs rely on C-specific behaviour and quirks and some C features were born and designed around the same historical context UNIX was born

By t-3 2026-03-0212:381 reply

>> Best language for SIMD integration? C

Uh, no. C intrinsics are so much worse than just writing assembly that it's not even comparable.

By ajross 2026-03-0215:30

Agree to disagree there. For casual "I need to vectorize this code" tasks, modern compilers are almost magic. I mean, have you looked at the generated code for array-based numerics processing? It's like, you start the process of "vectorizing" the algorithm and realize the compiler already did 80% of it for you.

By Dwedit 2026-03-0121:575 reply

Using mmap means that you need to be able to handle memory access exceptions when a disk read or write fails. Examples of disk access that fails includes reading from a file on a Wifi network drive, a USB device with a cable that suddenly loses its connection when the cable is jiggled, or even a removable USB drive where all disk reads fail after it sees one bad sector. If you're not prepared to handle a memory access exception when you access the mapped file, don't use mmap.

By phoronixrly 2026-03-0122:531 reply

Ah, reminds me of 'Are You Sure You Want to Use MMAP in Your Database Management System? (2022)' https://db.cs.cmu.edu/mmap-cidr2022/

By hyc_symas 2026-03-0320:59

Ah yes, the ever popular "mongoDB's developers were incompetent therefore mmap is bad" paper.

Pure tripe. https://www.symas.com/post/are-you-sure-you-want-to-use-mmap...

By justmedep 2026-03-0122:111 reply

You can even mmap a socket on some systems (iOS and macOS via GCD). But doing that is super fragile. Socket errors are swallowed.

My interpretation always was the mmap should only be used for immutable and local files. You may still run into issues with those type of files but it’s very unlikely.

By kelnos 2026-03-021:421 reply

mmap is also good for passing shared memory around.

(You still need to be careful, of course.)

By usefulcat 2026-03-026:06

It’s also great for when you have a lot of data on local storage, and a lot of different processes that need to access the same subset of that data concurrently.

Without mmap, every process ends up caching its own private copy of that data in memory (think fopen, fread, etc). With mmap, every process accesses the same cached copy of that data directly from the FS cache.

Granted this is a rather specific use case, but for this case it makes a huge difference.

By tremon 2026-03-020:422 reply

C doesn't have exceptions, do you mean signals? If not, I don't see how that is that any different from having to handle I/O errors from write() and/or open() calls.

By lights0123 2026-03-020:48

Yes, it’s the SIGBUS signal.

By fulafel 2026-03-027:401 reply

It's very different since at random points of your program your signal handler is caleld asynchronously, and you can only do a very limited signal-safe things there, and the flow of control in your i/o, logic etc code has no idea it's happening.

tldr; it's very different.

By fc417fc802 2026-03-0210:153 reply

Well at least in this case the timing won't be arbitrary. Execution will have blocked waiting on the read and you will (AFAIK) receive the signal promptly in this case. Since the code in question was doing IO that you knew could fail handling the situation can be as simple as setting a flag from within the signal handler.

I'm unclear what would happen in the event you had configured the mask to force SIGBUS to a different thread. Presumably undefined behavior.

> If multiple standard signals are pending for a process, the order in which the signals are delivered is unspecified.

That could create the mother of all edgecases if a different signal handler assumed the variable you just failed to read into was in a valid state. More fun footguns I guess.

By xmcqdpt2 2026-03-0213:181 reply

> Since the code in question was doing IO that you knew could fail handling the situation can be as simple as setting a flag from within the signal handler.

If you are using mmap like malloc (as the article does) you don't necessarily know that you are "reading" from disk. You may have passed the disk-backed pointers to other code. The fact that malloc and mmap return the same type of values is what makes mmap in C so powerful AND so prone to issues.

By fulafel 2026-03-0218:47

Yes, and for writing (the example is read-write) it's of course yet another kettle of fish. The error might never get reported at all. Or you might get a SIGBUS (at least with sparse files).

By ozgrakkurt 2026-03-049:09

Signals are extremely bad to work with. Would rather do error handling in javascript. It feels like trying to write low level primitives in rust or trying to learn c++. There are so many edge cases that I start questioning what am I doing with my life

By groundzeros2015 2026-03-0214:431 reply

> file on a Wifi network drive,

I would simply not mmap this.

> If you're not prepared to handle a memory access exception when you access the mapped file, don't use mmap.

fread can fail too. I don't know why you would be prepared for one and not the other.

By Dwedit 2026-03-0214:461 reply

Because you're way deep down the call stack in some function that happened to take in a pointer, far far away from the code that opened the file.

By groundzeros2015 2026-03-0214:51

If that's your program design then fread is not a substitute. Because you would need to pass in the FILE* pointer to all those calls.

And what are you hoping to do in those call stacks when you find an error? Can any of that logic hope to do anything useful if it can't access this data? Let the OS handle this. crash your program and restart.

By lxgr 2026-03-0218:091 reply

Do these really ever result in access failures instead of just hangs? How are they surfaced to processes?

In my experience, all these things just cause whatever process is memory mapping to freeze up horribly and make me regret ever using a network file system or external hard drive.

By jasomill 2026-03-0222:51

Depends on the implementation.

Most I/O calls return errors when reads or writes fail, but NFS, for example, would traditionally block on network errors by default — you probably don't want your entire lab full of diskless workstations to kernel panic every time there's a transient network glitch.

You also have the issue of multiple levels of caching and when and how to report delayed errors to programs that don't explicitly use mechanisms like fsync.