Zeroperl: Sandboxing Perl with WebAssembly

2025-02-1120:116834andrews.substack.com

I’m building a new startup and file metadata plays an important role.

I’m building a new startup and file metadata plays an important role. There are thousands of file formats, each format may have dozens of versions, and each stores metadata differently.

While I would love to one day invest in creating a library to handle this monumental task, I think many would agree the best tool for this job is ExifTool by Phil Harvey.

Problem solved - throw it in a Docker container, run it when a file is uploaded, and call it a day? Not quite.

For some file formats, certain metadata can only be read by OS tooling. On macOS, for example, mdls is capable of reading vector embedding information from the QuickTime container which is useful for RAG applications. Our use-case also needs metadata to be present when a file is uploaded - extracting the data on our servers means we add considerable overhead to upload post-processing & we lose data that is useful to customers.

So we need to extract metadata client-side and staple it to the upload. Herein begins a journey of self-inflicted pain and suffering.

ExifTool is written in Perl.

No matter what programming language you choose to adopt for your new ideas, if you plan to grow and scale, there is a universal certainty that at some point your stack will come to depend on one piece of tooling written long ago in Python or Perl.

Python is objectively the most popular language, and as such a lot more community effort has been made to make embedding Python really easy. Perl hasn't received as much love.

To give Perl credit, they do have first-party documentation on how to embed Perl in a C program, however, this is through dynamic linking so Perl still needs to be installed on the users system. On macOS Perl is installed by default, but we have no guarantees if dependencies we need our installed, or if the prefix has been modified in a compatible way, and ultimately we shouldn’t be modifying the users system - especially since our SDK or CLI could be ran in an environment where changes aren’t persisted so we’d have to bootstrap each time.

On Windows, we would need to ship the entirety of Perl (which is what ExifTool does via PAR::Packer) sign every binary, and pray no dynamic dependencies are missing.

Other options such as percc just don’t work. It is often heard that only Perl can parse Perl. This is not true. Perl cannot be parsed at all, it can only be executed. Perl has various built-in ambiguities that can only be resolved at runtime, so writing a self-contained interpreter is also nothing more than a fever dream.

But we’re already tugging on this thread, so let’s see where it leads. If we do ship our own build of Perl, that build needs to be statically linked. So how hard could that be?

Very.

There is essentially no documentation to achieve this, all we have are scattered forum post from venerable ghost. If they could do it, so could I.

Perl’s build system leans on a script called Configure, which is essentially a massive (and I do mean massive) shell script that probes your system and figures out how it should compile Perl. We will get more into that later.

Unfortunately, I use an M-series Mac, and no matter what I tried, Perl just wouldn’t build—not even dynamically. I’m sure it’s user error on my part (perhaps I failed to properly appease the camel). Regardless, local compilation on macOS wasn’t in the cards. Even with Rosetta and a Linux VM, everything was just broken enough to halt progress.

Picture the rest of this saga as a game of “golf” with a GitHub Action—endless commits, pushes, and guesswork—because that’s exactly how it played out.

After many hours, this was the magical incantation that finally produced a static build of Perl on Linux:

Let’s just look at the artif-

static perl compiled to 419.8 MB

There’s little point in explaining why this path turned out to be a dead end. Suffice it to say, I nearly gave up. Then I recalled the experiment where I compiled .NET to WebAssembly, that produced really small modules — granted, the tooling there actually existed, but it gave me enough hope to think, “Maybe I can pull off something similar here.”

I’m hardly the first to toy with compiling Perl to WebAssembly. Projects like Perl.js forked the now-deprecated microperl, and WebPerl used an older version of mainline Perl. Both involved extensive patching to make Emscripten cooperate — something I definitely don’t want to repeat. They were products of their time, when Emscripten was still maturing and heavy modification was unavoidable.

And the work done to get a static build was not entirely a waste, because only Perl can compile Perl, so we will utilize that in our workflow to build our WebAssembly version.

Remember that giant Configure script we talked about? It doesn’t just sniff around your system — it can be coaxed into building Perl in just about any configuration under the sun.

All you need is a deep reservoir of patience and the willingness to juggle endless environment variables, flags, and a hint file.

My hunch was that Emscripten’s NODERAWFS feature would theoretically let Perl’s file-system calls just work—no heavy patching needed. If I could build a WebAssembly version of Perl that still saw the world as a regular filesystem, then tools like ExifTool might just function out of the box. Astoundingly, after enough trial and error, I ended up with a working Perl build in WebAssembly form. Even more surprising? ExifTool ran! Two major takeaways:

  1. It is possible to compile Perl to WebAssembly without patching the source.

  2. ExifTool can function in that environment, albeit with the Emscripten-generated JavaScript glue.

But (of course there’s a “but”): NODERAWFS isn’t truly raw filesystem access. The underlying data structures assume V8’s internal layout, so if you were hoping to simply drop this build into some other engine or reimplement the I/O layer, you’re in for a world of hurt. Replacing Emscripten’s JavaScript glue turned out to be unrealistic.

Still, this is a major win. I proved to myself that you can, in principle, compile Perl to WebAssembly and run Exiftool. But if I want a more engine-agnostic build that doesn’t rely so heavily on Emscripten’s JavaScript glue, the next logical step is WASI.

Given we now have a Configure hint file that produces a working Emscripten build, I should be able replace emcc with the wasi-sdk without any iss-

There were in fact too many issues to count

Emscripten isn’t magic just because it makes native code run in the browser — it’s magic because it does so while shielding you from the dark horrors of a 37-year-old codebase. I’m not just talking about labyrinthine lines of C; I mean the unyielding force that is Perl’s Configure script, which insists on compiling and running little test executables—even when you explicitly tell it, “We’re cross-compiling here, please don’t do that!”

The workaround was to reproduce the Python scripts Emscripten uses to sidestep those tests, because apparently that’s how life works.

At this point, my workflow was simple (and maddening):

1. Tweak the hint file.

2. Lob new arguments at clang.

3. Wait for the build to fail on CI.

4. Repeat.

All the while, I’m chanting, “I do not want to patch Perl. I will not patch Perl.” - I had to patch Perl. Larry Wall, the inventor of the patch tool — also created Perl; so in a way, it’s poetic.

The first patch was relatively harmless: fix a bug in Configure, which ironically was itself causing build failures. But the real trouble started with setjmp/longjmp. Traditionally, these are used for exceptions, but Perl also uses them to run scripts. So, to compile successfully, we need -lsetjmp plus the LLVM flag -mllvm -wasm-enable-sjlj.

Here’s the rub: no major WASM runtime supports the latest exception-handling proposal. Browsers partially support the phase-3 version, but neither Wasmtime nor Wasmer implements the `exnref` side of things. Meanwhile, WASI-SDK’s documentation suggests that if your build references these libraries, your only choice is to run:

wasm-opt --translate-to-exnref -all -o your_app.exnref.wasm your_app.wasm

…and then you’re stuck with exactly one tool that can run it: toywasm

Thankfully, a bit of luck arrived in the form of WACS — a C#-based WebAssembly runtime published by Kelvin Nishikawa. It didn’t initially support the current exception-handling proposal, but after I opened an issue, he added support in just two days. I contributed some fixes to the WASIP1 filesystem implementation in WACS, I was unblocked.

Then came my first Perl bug: a bizarre preprocessor glitch. I still haven’t figured out the real fix, but I worked around it by setting the environment variable LC_ALL=C. Next up were integer overflows in file system calls, no elegant solution so far, but this quick patch resolves them.

And then, against all odds, it finally worked.

a .NET CLI program showing the output of the WASI 'perl -V'

After applying wasm-opt, the build is 6.9MB. We’re not there yet though.

The above build needs a host path to our “prefix” directory passed in as an argument, which we grant through WASI preopens. In theory, that’s fine: you drop the prefix files on disk, configure WASI, and Perl can open them. In practice, I noticed painful slowdowns. Perl scans a bunch of directories that don’t exist (it thinks it’s still on our CI machine with absolute paths), and every file read has to cross the host boundary. Also, what happens if those files get deleted?

So I figured, “Why not bundle the prefix inside the module and keep everything in guest memory?” That way, Perl can still open the modules it needs, but we don’t rely on the host filesystem for them—and for any other files, we fall back to WASI’s real I/O.

At first, I found wasi-vfs. It does exactly what I need. Unfortunately, it’s just a wrapper around wizer, which uses wasmtime, which doesn’t support the new exception-handling proposal. Crashed right out of the gate.

Then I saw a mention of using WASI Preview2 with something called WASI-Virt. I gave it a go, but it’s basically a haunted forest: no docs, half-implemented specs, Rust-only examples, and a WASI-SDK that just coughed up broken binaries whenever I tried enabling preview2 and working with ‘worlds’. That got old fast.

So I just rolled my own simple file system:

  1. A script that slurps all the files in the prefix directory, concatenates them into a single binary blob, and spits out a C header + source file enumerating each file’s offsets in that blob.

  2. I compile and link with flags like -Wl,--wrap=open, -Wl,--wrap=read, and so on. That means whenever Perl tries to call the real open(), it actually calls our function __wrap_open() first. We check if the requested path is in the “builtin” prefix. If it is, We serve the file directly from memory (via fmemopen()); if not, we pass the call along to the original function (like __real_open).

  3. A mini VFS table that keeps track of file descriptors, in-memory file sizes, and FILE* pointers. The code sets aside a little array (sfs_table), which we populate when we “open” a builtin file. The actual data for that file comes from the blob we generated in step 1, so we just carve out the correct slice of bytes and treat it like a read-only buffer. If Perl tries to read() from that file descriptor, it calls our __wrap_read(), which checks the table to see if it’s one of our in-memory handles—if yes, it just does an fread() from the buffer. Otherwise, it calls __real_read().

  4. We do the same trick with stat(), fstat(), lseek(), etc., always checking if the file is “ours” first and falling back if it isn’t. That keeps Perl’s scripts that rely on file checks from freaking out.

Once all that’s in place, Perl’s clueless that it’s not dealing with a “real” filesystem. I did find some weird bug where Perl panics if a file descriptor is > 16, so our simple file system shares the possible FD space with ones from the WASI host, but otherwise it works and the prefix is entirely bundled.

But, with every solution comes another problem: the build is now 50 MB.

To reduce the size of the build I’d remove a file from the prefix, create a new build, see if anything exploded, rinse and repeat. That might not sound too bad, but remember, Perl has a lot of files—docs, network modules we’ll never use, but you can’t assume how Perl interacts with Perl. If the build still worked, I’d record that filename in a delete.txt for the CI pipeline to clean up automatically.

After purging everything I could, each .pm and .pl file still contained large amounts of whitespace and inline docs. I used Perl::Strip to fix that. I ran it across what remained in the prefix, and although the CI time jumped from ten minutes to nearly an hour, I ended up with a 9.1 MB WebAssembly build of Perl that is fully sandboxed and self-contained.

I achieved what I set out to do. But (and this is the last one), because there is no broad support for the new exception proposal in WASI/WebAssembly runtimes, zeroperl is useless to you.

At least until, if ever, wasmtime or wasmer actually support the proposal. But in the spirit of open source, you can find the full code here. I’m gonna go put ExifTool in Docker now. And now every WebAssembly runtime is supported. More details here.

Thanks for reading ./make! This post is public so feel free to share it.

Share

Credits:

Vadim Kantorov: for providing a great starting point with Emscripten and for trading notes with me over email.

Kelvin Nishikawa: for the amazing work on WACS and for allowing me to collaborate on it.

Leon Timmermans: for attempting to diagnose the integer overflow cause

Karl Williamson: for the Configure patch

Follow me on: X / BlueSky


Read the original article

Comments

  • By dang 2025-02-1122:43

    Since this is Part 1, we merged the comments from the Part 2 thread hither:

    Get in loser. We're rewinding the stack - https://news.ycombinator.com/item?id=43014070

    Readers may want to look at both articles of course!

  • By ncruces 2025-02-1123:35

    Oh I'm so interested in this.

    I've wanted to use wazero to run my Exiftool [1] for quite a while. Just as I use wazero to sandbox dcraw [2].

    But WASI Perl never materialized.

    This may just be what I'm missing.

    [1]: https://github.com/ncruces/go-exiftool

    [2]: https://pkg.go.dev/github.com/ncruces/rethinkraw@v0.10.7/pkg...

  • By adolph 2025-02-1117:491 reply

    Subhead is "Sandboxing Perl with WebAssembly - Part 2."

    The subhead sounds weird, but part 1 makes more sense and is pretty interesting. Perl has many modules to deal with file formats nobody has used since Perl's prime. It isn't totally clear to me if the goal is to compile the Perl interpreter into WASM or interpreter + modules. In any either case the goal is to re-use the original tools within new tooling.

    I’m building a new startup and file metadata plays an important role. There are thousands of file formats, each format may have dozens of versions, and each stores metadata differently.

    Our use-case also needs metadata to be present when a file is uploaded - extracting the data on our servers means we add considerable overhead to upload post-processing & we lose data that is useful to customers.

    So we need to extract metadata client-side and staple it to the upload. Herein begins a journey of self-inflicted pain and suffering.

    ExifTool is written in Perl.

    https://andrews.substack.com/p/zeroperl-sandboxed-perl-with-...

    • By jasonthorsness 2025-02-1118:111 reply

      Is there anything else in the same class as ExifTool - super valuable but the only implementation is Perl?

      • By tyingq 2025-02-121:23

        Not sure if you would say these are the still the only implementations, but autoconf, Bugzilla and SpamAssassin were all at least once thought of that way.

HackerNews