Hacker News

Deep Down the Rabbit Hole: Bash, OverlayFS, and a 30-Year-Old Surprise

2025-06-2513:436214sigma-star.at

This blog post describes a recent debugging session that led through a surprising set of issues involving Bash, `getcwd()`, and OverlayFS. What started as a simple customer bug report turned into a…

Show article

This blog post recounts a recent debugging session that uncovered a surprising set of issues involving Bash, getcwd()¹, and OverlayFS². What began as a simple customer bug report evolved into a deep dive worth sharing.

Initial Bug Report

A customer reported that OpenSSH scp³ failed after switching to OverlayFS. We found the following error in the logs:

shell-init: error retrieving current directory: \
getcwd: cannot access parent directories: Inappropriate ioctl for device

After analyzing the report, we realized the message didn’t come from scp itself but from the Bash shell. We asked the key question: why couldn’t Bash determine the current working directory, and why did it fail with ENOTTY (Inappropriate ioctl for device)?

Ruling Out the Kernel

Because the issue appeared after the introduction of OverlayFS, we reviewed the OverlayFS source code in the Linux kernel for any code paths that return ENOTTY. Although such paths exist, we considered hitting them highly unlikely.

Bash uses glibc and is written in C. We examined the glibc system call wrapper for getcwd() but found no logic that could return ENOTTY. The wrapper mainly handles buffer allocation and falls back to a generic implementation if the system call fails.

To test this theory, we enabled system call tracing. Surprisingly, the trace revealed that the getcwd() system call never got called. Since glibc offers multiple getcwd() implementations depending on the system, we double-checked that we had reviewed the correct Linux-specific one. We found no code path that bypassed the system call.

Bash’s home made `getcwd()`

A hunch led us to check how Bash links to the getcwd symbol:

$ nm -D bash | grep getcwd
...
000c7b10 T getcwd
...

This showed that Bash includes its own getcwd() function rather than relying on glibc’s version. We expected this output instead:

$ nm -D bash | grep getcwd
...
         U getcwd
...

Surprised, we inspected the Bash source and confirmed it does contain a getcwd() implementation, but guarded by the following:

#if !defined (HAVE_GETCWD)

Developers originally intended this fallback for ancient Unix systems lacking the getcwd() system call. On Linux, HAVE_GETCWD should normally be defined.

We confirmed in config.h:

At first, this puzzled us, under normal conditions the implementation should never compile. But further inspection of config-bot.h showed this logic:

#if defined (HAVE_GETCWD) && defined (GETCWD_BROKEN) && !defined (SOLARIS)
# undef HAVE_GETCWD
#endif

Sure enough, our config.h defined GETCWD_BROKEN. That explained why Bash used its internal fallback. But why did the system consider getcwd() broken?

Cross-Compilation Confusion

We examined the output of the configure script in detail to trace the origin of GETCWD_BROKEN. We found this line:

checking if getcwd() will dynamically allocate memory with 0 size... \
configure: WARNING: cannot check whether getcwd allocates memory when cross-compiling \
-- defaulting to no

The check in aclocal.m4 sets GETCWD_BROKEN if it can’t confirm that getcwd() allocates memory with a zero-size buffer. Since the build occurred in a cross-compilation environment, the test defaulted to failure.

We discovered that Bash becomes problematic in cross-compilation environments. Since Bash is cross-compiled for ARM in this specific setup, this made sense. We then wondered why this issue wasn’t more widespread. After all, both Bash and OverlayFS are common in embedded systems.

Next, we looked into how major embedded Linux projects like Yocto handle cross-compiling Bash. Although the Bash Yocto recipe didn’t mention getcwd, we found this line in meta/site/common-glibc:

bash_cv_getcwd_malloc=${bash_cv_getcwd_malloc=yes}

Yocto explicitly overrides the test result to avoid the fallback. The embedded Linux build system we used didn’t apply such a workaround. This clarified the issue. After we implemented a similar override, the issue vanished.

Root Cause Analysis

At this point, we had identified and fixed the bug. But several questions remained:

Why did the issue appear only with OverlayFS?
Why did Bash’s fallback getcwd() fail?

During testing, we observed another error message:

shell-init: error retrieving current directory: \
getcwd: cannot access parent directories: Success

This indicated that errno was sometimes set to 0, suggesting no error occurred, yet getcwd() still failed.

OverlayFS and Inode Numbers

To answer the remaining questions, we analyzed Bash’s getcwd() implementation. On Linux, you can determine the current working directory in two ways:

Use the getcwd() system call
Read the /proc/self/cwd symlink

Bash’s implementation used neither, aiming to support systems lacking these features. In fact, the fallback dates back to the last millennium. It used a classic Unix algorithm to reconstruct the working directory path:

It calls stat(".") to obtain the inode number of the current directory.
It calls readdir("..") to read the parent directory’s entries.
It compares inode numbers to identify "."’s name.

It repeats this process recursively to climb the full path.

Note that this simplified description omits many details. In practice, you must evaluate both inode (st_ino) and device (st_dev) to work across mount points.

Tracing revealed that the fallback getcwd() failed on the very first path component. stat(".")⁴ returned an inode number N, but readdir("..")⁵ returned no matching directory with and inode number N.

OverlayFS merges two directories, a lower (read-only) and an upper (writable) layer. When calling readdir() on a directory, OverlayFS combines entries from both layers without performing full lookups. It returns the underlying inode numbers directly, unmodified.

This design means that inode numbers from readdir() don’t guarantee uniqueness or stability in the merged view. Two entries might even share an inode number without being hard links. OverlayFS uses this approach to provide fast directory listings, performing a full lookup for each entry would incur performance penalties.

Conversely, stat() triggers a full lookup. OverlayFS allocates an inode object that provides stable and unique inode numbers. That stability is crucial for tools like find or du.

Bash’s fallback getcwd() assumes that the inode from stat() matches one returned by readdir(). OverlayFS breaks that assumption.

We eventually realized that OverlayFS documentation acknowledges this limitation: For directories, the inode number from readdir() may not match the number from stat().

The Role of the xino Feature

OverlayFS can deliver stable inode numbers via readdir() when the xino feature is active. 64-bit systems can encode extra data (e.g., instance numbers) into inode fields to prevent collisions. This works without requiring a full lookup and does not hurt readdir() performance.

However, 32-bit systems lack this space and the xino feature it not available. We encountered the original problem on a 32-bit ARM platform, which explained why the issue occurred there.

Incorrect Use of `readdir()` in Bash

One question remained: why did getcwd() sometimes fail with ENOTTY?

Upon inspecting Bash’s getcwd(), we noticed it misused readdir() slightly:

readdir() returns NULL both on EOF and on error.
To distinguish between an error condition and the end of the directory list, the caller must set errno to zero before calling readdir().
If readdir() returns NULL and errno == 0, it means EOF.
Bash forgot to reset errno before the call. For about 30 years, no one noticed.

As a result, when readdir() returned NULL with no match, Bash incorrectly assumed an error. It returned NULL and left errno in an undefined state. Sometimes, ENOTTY from a previous system call remained, producing misleading errors.

We have reported the issue to the GNU Bash project. Once the bug report becomes publicly visible, it will be linked here.

Conclusion

This bug hunt revealed several contributing factors:

A misconfigured cross-compilation environment caused Bash to use its fallback getcwd().
OverlayFS introduced subtle inode behavior differences, especially on 32-bit systems.
Bash’s fallback getcwd() relied on assumptions that failed with OverlayFS.
A decades-old oversight in Bash’s error handling created misleading errno values.

While we resolved the issue with a simple build tweak, the investigation highlighted deeper lessons about portability assumptions, legacy code, and filesystem complexity.

Read the original article

Deeg9rie9usi

Karma: 568

@Hacker__News
@hacker._news

Comments

By chubot 2025-06-2520:30

Wow great bug!

> Bash forgot to reset errno before the call. For about 30 years, no one noticed

I have to say, this part of the POSIX API is maddening!

99% of the time, you don't need to set errno = 0 before making a call. You check for a non-zero return, and only then look at errno.

But SOMETIMES you need to set errno = 0, because in this case readdir() returns NULL on both error and EOF.

I actually didn't realize this before working on https://oils.pub/

---

And it should go without saying: Oils simply uses libc - we don't need to support system with a broken getcwd()!

Although a funny thing is that I just fixed a bug related to $PWD that AT&T ksh (the original shell, that bash is based on) hasn't fixed for 30+ years too!

(and I didn't realize it was still maintained)

https://www.illumos.org/issues/17442

https://github.com/oils-for-unix/oils/issues/2058

There is a subtle issue with respect to:

1) "trusting" the $PWD value you inherit from another process

2) Respecting symlinks - this is the reason the shell can't just call getcwd() !

    if (*p != '/' || stat(p, &st1) || stat(".", &st2) ||
        st1.st_dev != st2.st_dev || st1.st_ino != st2.st_ino)
        p = 0;

Basically, the shell considers BOTH the inherited $PWD and the value of getcwd() to determine its $PWD. It can't just use one or the other!

By JonChesterfield 2025-06-2523:511 reply

The response to the bug on the mailing list is disheartening. Report goes "set errno=0 so your error message makes sense". Didn't get a thanks, fixed.

Instead there's objections on the basis "filesystems shouldn't work like that".

By chubot 2025-06-261:23

There seem to be a bunch of toxic people on the bash mailing list, and I think many or all of them don't even contribute code

The person who responded dismissively later says "I'm just another user."

---

Every commit since they started using git in 2009 is attributed to one person:

https://cgit.git.savannah.gnu.org/cgit/bash.git/log/

I think occasionally contributed patches are applied, but this is not apparent in source control.

I was attacked on the bash mailing list a several years ago, so I don't go there anymore :-)

By justincormack 2025-06-2521:241 reply

Most of the stuff that configure scripts check is obsolete, and breaks in situations like this as the checks are often not workable without running code. It is likely the check does not apply to any system that has existed for decades. Lots of systems have disabled eg Nix in 2017 [1]

[1] https://github.com/NixOS/nixpkgs/commit/dff0ba38a243603534c9...

By arp242 2025-06-2522:37

I had a look at the bash source code a few years back, and there are tons of hacks and workarounds for 1980s-era systems. Looking at the git log, GETCWD_BROKEN was added in bash 1.14 from 1996, presumably to work around some system at the time (a system which was perhaps already old in 1996, but it's not detailed which).

Also, that getcwd.c which contains the getcwd() fallback and bug is in K&R C, which should be a hint at how well maintained all of this is. Bash takes "don't fix it if it ain't broke" to new levels, to the point of introducing breakage like here (the bash-malloc is also notorious for this – no idea why that's still enabled by default).