This blog post describes a recent debugging session that led through a surprising set of issues involving Bash, `getcwd()`, and OverlayFS. What started as a simple customer bug report turned into a…
This blog post recounts a recent debugging session that uncovered a surprising set of issues involving Bash, getcwd()
1, and OverlayFS2.
What began as a simple customer bug report evolved into a deep dive worth sharing.
A customer reported that OpenSSH scp3 failed after switching to OverlayFS. We found the following error in the logs:
shell-init: error retrieving current directory: \
getcwd: cannot access parent directories: Inappropriate ioctl for device
After analyzing the report, we realized the message didn’t come from scp itself but from the Bash shell.
We asked the key question: why couldn’t Bash determine the current working directory, and why did it fail with ENOTTY
(Inappropriate ioctl for device)?
Because the issue appeared after the introduction of OverlayFS, we reviewed the OverlayFS source code in the Linux kernel for any code paths that return ENOTTY
.
Although such paths exist, we considered hitting them highly unlikely.
Bash uses glibc and is written in C.
We examined the glibc system call wrapper for getcwd()
but found no logic that could return ENOTTY
.
The wrapper mainly handles buffer allocation and falls back to a generic implementation if the system call fails.
To test this theory, we enabled system call tracing.
Surprisingly, the trace revealed that the getcwd()
system call never got called.
Since glibc offers multiple getcwd()
implementations depending on the system, we double-checked that we had reviewed the correct Linux-specific one.
We found no code path that bypassed the system call.
getcwd()
A hunch led us to check how Bash links to the getcwd
symbol:
$ nm -D bash | grep getcwd
...
000c7b10 T getcwd
...
This showed that Bash includes its own getcwd()
function rather than relying on glibc’s version.
We expected this output instead:
$ nm -D bash | grep getcwd
...
U getcwd
...
Surprised, we inspected the Bash source and confirmed it does contain a getcwd()
implementation, but guarded by the following:
#if !defined (HAVE_GETCWD)
Developers originally intended this fallback for ancient Unix systems lacking the getcwd()
system call.
On Linux, HAVE_GETCWD
should normally be defined.
We confirmed in config.h
:
At first, this puzzled us, under normal conditions the implementation should never compile.
But further inspection of config-bot.h
showed this logic:
#if defined (HAVE_GETCWD) && defined (GETCWD_BROKEN) && !defined (SOLARIS)
# undef HAVE_GETCWD
#endif
Sure enough, our config.h
defined GETCWD_BROKEN
.
That explained why Bash used its internal fallback.
But why did the system consider getcwd()
broken?
We examined the output of the configure
script in detail to trace the origin of GETCWD_BROKEN
.
We found this line:
checking if getcwd() will dynamically allocate memory with 0 size... \
configure: WARNING: cannot check whether getcwd allocates memory when cross-compiling \
-- defaulting to no
The check in aclocal.m4
sets GETCWD_BROKEN
if it can’t confirm that getcwd()
allocates memory with a zero-size buffer.
Since the build occurred in a cross-compilation environment, the test defaulted to failure.
We discovered that Bash becomes problematic in cross-compilation environments. Since Bash is cross-compiled for ARM in this specific setup, this made sense. We then wondered why this issue wasn’t more widespread. After all, both Bash and OverlayFS are common in embedded systems.
Next, we looked into how major embedded Linux projects like Yocto handle cross-compiling Bash.
Although the Bash Yocto recipe didn’t mention getcwd
, we found this line in meta/site/common-glibc
:
bash_cv_getcwd_malloc=${bash_cv_getcwd_malloc=yes}
Yocto explicitly overrides the test result to avoid the fallback. The embedded Linux build system we used didn’t apply such a workaround. This clarified the issue. After we implemented a similar override, the issue vanished.
At this point, we had identified and fixed the bug. But several questions remained:
getcwd()
fail?During testing, we observed another error message:
shell-init: error retrieving current directory: \
getcwd: cannot access parent directories: Success
This indicated that errno
was sometimes set to 0
, suggesting no error occurred, yet getcwd()
still failed.
To answer the remaining questions, we analyzed Bash’s getcwd()
implementation.
On Linux, you can determine the current working directory in two ways:
getcwd()
system call/proc/self/cwd
symlinkBash’s implementation used neither, aiming to support systems lacking these features. In fact, the fallback dates back to the last millennium. It used a classic Unix algorithm to reconstruct the working directory path:
stat(".")
to obtain the inode number of the current directory.readdir("..")
to read the parent directory’s entries."."
’s name.It repeats this process recursively to climb the full path.
Note that this simplified description omits many details.
In practice, you must evaluate both inode (st_ino
) and device (st_dev
) to work across mount points.
Tracing revealed that the fallback getcwd()
failed on the very first path component.
stat(".")
4 returned an inode number N
, but readdir("..")
5 returned no matching directory with and inode number N
.
OverlayFS merges two directories, a lower (read-only) and an upper (writable) layer.
When calling readdir()
on a directory, OverlayFS combines entries from both layers without performing full lookups.
It returns the underlying inode numbers directly, unmodified.
This design means that inode numbers from readdir()
don’t guarantee uniqueness or stability in the merged view.
Two entries might even share an inode number without being hard links.
OverlayFS uses this approach to provide fast directory listings, performing a full lookup for each entry would incur performance penalties.
Conversely, stat()
triggers a full lookup.
OverlayFS allocates an inode object that provides stable and unique inode numbers.
That stability is crucial for tools like find
or du
.
Bash’s fallback getcwd()
assumes that the inode from stat()
matches one returned by readdir()
.
OverlayFS breaks that assumption.
We eventually realized that OverlayFS documentation acknowledges this limitation:
For directories, the inode number from readdir()
may not match the number from stat()
.
OverlayFS can deliver stable inode numbers via readdir()
when the xino
feature is active.
64-bit systems can encode extra data (e.g., instance numbers) into inode fields to prevent collisions.
This works without requiring a full lookup and does not hurt readdir()
performance.
However, 32-bit systems lack this space and the xino
feature it not available.
We encountered the original problem on a 32-bit ARM platform, which explained why the issue occurred there.
readdir()
in BashOne question remained: why did getcwd()
sometimes fail with ENOTTY
?
Upon inspecting Bash’s getcwd()
, we noticed it misused readdir()
slightly:
readdir()
returns NULL
both on EOF and on error.errno
to zero before calling readdir()
.readdir()
returns NULL
and errno == 0
, it means EOF.errno
before the call. For about 30 years, no one noticed.As a result, when readdir()
returned NULL
with no match, Bash incorrectly assumed an error.
It returned NULL
and left errno
in an undefined state.
Sometimes, ENOTTY
from a previous system call remained, producing misleading errors.
We have reported the issue to the GNU Bash project. Once the bug report becomes publicly visible, it will be linked here.
This bug hunt revealed several contributing factors:
getcwd()
.getcwd()
relied on assumptions that failed with OverlayFS.errno
values.While we resolved the issue with a simple build tweak, the investigation highlighted deeper lessons about portability assumptions, legacy code, and filesystem complexity.
Wow great bug!
> Bash forgot to reset errno before the call. For about 30 years, no one noticed
I have to say, this part of the POSIX API is maddening!
99% of the time, you don't need to set errno = 0 before making a call. You check for a non-zero return, and only then look at errno.
But SOMETIMES you need to set errno = 0, because in this case readdir() returns NULL on both error and EOF.
I actually didn't realize this before working on https://oils.pub/
---
And it should go without saying: Oils simply uses libc - we don't need to support system with a broken getcwd()!
Although a funny thing is that I just fixed a bug related to $PWD that AT&T ksh (the original shell, that bash is based on) hasn't fixed for 30+ years too!
(and I didn't realize it was still maintained)
https://www.illumos.org/issues/17442
https://github.com/oils-for-unix/oils/issues/2058
There is a subtle issue with respect to:
1) "trusting" the $PWD value you inherit from another process
2) Respecting symlinks - this is the reason the shell can't just call getcwd() !
if (*p != '/' || stat(p, &st1) || stat(".", &st2) ||
st1.st_dev != st2.st_dev || st1.st_ino != st2.st_ino)
p = 0;
Basically, the shell considers BOTH the inherited $PWD and the value of getcwd() to determine its $PWD. It can't just use one or the other!The response to the bug on the mailing list is disheartening. Report goes "set errno=0 so your error message makes sense". Didn't get a thanks, fixed.
Instead there's objections on the basis "filesystems shouldn't work like that".
There seem to be a bunch of toxic people on the bash mailing list, and I think many or all of them don't even contribute code
The person who responded dismissively later says "I'm just another user."
---
Every commit since they started using git in 2009 is attributed to one person:
https://cgit.git.savannah.gnu.org/cgit/bash.git/log/
I think occasionally contributed patches are applied, but this is not apparent in source control.
I was attacked on the bash mailing list a several years ago, so I don't go there anymore :-)
Most of the stuff that configure scripts check is obsolete, and breaks in situations like this as the checks are often not workable without running code. It is likely the check does not apply to any system that has existed for decades. Lots of systems have disabled eg Nix in 2017 [1]
[1] https://github.com/NixOS/nixpkgs/commit/dff0ba38a243603534c9...
I had a look at the bash source code a few years back, and there are tons of hacks and workarounds for 1980s-era systems. Looking at the git log, GETCWD_BROKEN was added in bash 1.14 from 1996, presumably to work around some system at the time (a system which was perhaps already old in 1996, but it's not detailed which).
Also, that getcwd.c which contains the getcwd() fallback and bug is in K&R C, which should be a hint at how well maintained all of this is. Bash takes "don't fix it if it ain't broke" to new levels, to the point of introducing breakage like here (the bash-malloc is also notorious for this – no idea why that's still enabled by default).