Ruby's Timeout is dangerous and Thread.raise is terrifying (2015)

2024-06-039:42160126jvns.ca

This is already documented in Timeout: Ruby’s most dangerous API. And normally I don’t like making blanket statements about language features. But I had a bad day at work because of this issue. So…

This is already documented in Timeout: Ruby’s most dangerous API. And normally I don’t like making blanket statements about language features. But I had a bad day at work because of this issue. So today, we’re talking about Timeout! :)

First! What is Timeout? Let’s say you have a bunch of code, that might be slow. A network request, a long computation, whatever. Ruby’s timeout documentation helpfully says

Timeout provides a way to auto-terminate a potentially long-running operation if it hasn’t finished in a fixed amount of time.

require 'timeout'
status = Timeout::timeout(5) {
  # Something that should be interrupted if it takes more than 5 seconds...
}

AWESOME. This is so much easier than wrangling network socket options which might be set deep inside some client library. Seems great!

I tried using Timeout at work last week, and it resulted in an extremely difficult to track down bug. I felt awesome about tracking it down, and upset with myself about creating it in the first place. Let’s talk a little more about this (mis)feature.

Timeout: how it works (and why Thread.raise is terrifying)

Its implementation originally seems kind of clever. You can read the code here. It starts a new thread, which sets the original thread to x, sleeps for 5 seconds, then raises an exception on the main thread when it’s done, interrupting whatever it was doing.

begin
  sleep sec
rescue => e
  x.raise e
else
  x.raise exception, message
end

Let’s look at the documentation on Thread.raise. It says:

Raises an exception (see Kernel::raise) from thr. The caller does not have to be thr.

This is where the implications get interesting, and terrifying. This means that an exception can get raised:

  • during a network request (ok, as long as the surrounding code is prepared to catch Timeout::Error)
  • during the cleanup for the network request
  • during a rescue block
  • while creating an object to save to the database afterwards
  • in any of your code, regardless of whether it could have possibly raised an exception before

Nobody writes code to defend against an exception being raised on literally any line. That’s not even possible. So Thread.raise is basically like a sneak attack on your code that could result in almost anything. It would probably be okay if it were pure-functional code that did not modify any state. But this is Ruby, so that’s unlikely :)

Timeout uses Thread.raise, so it is not safe to use.

Other languages and Thread.raise

So, how do other languages approach this? Go doesn’t have exceptions, Javascript doesn’t have threads – let’s talk about Python, Java, and C#, and C++.

Java has java.lang.Thread.stop, which does essentially the same thing. It was deprecated in Java 1.2, in 1998, disabled entirely in Java 8, and its documentation reads:

Deprecated. This method is inherently unsafe. See stop() for details. An additional danger of this method is that it may be used to generate exceptions that the target thread is unprepared to handle (including checked exceptions that the thread could not possibly throw, were it not for this method). For more information, see Why are Thread.stop, Thread.suspend and Thread.resume Deprecated?.

Python has thread.interrupt_main(), which does the same thing as Ctrl+C-ing a program from your terminal. I’m not really sure what to say about this – certainly using thread.interrupt_main() also isn’t really a good idea, and it’s more limited in what it can do. I can’t find any reference to anybody considering using it for anything serious.

C# has Thread.Abort() which throws a ThreadAbortException in the thread. Googling it finds me a series of StackOverflow discussions & forum posts about how it’s dangerous and should not be used, for the reasons we’ve learned about.

C++: std::threads are not interruptible.

Not just an implementation issue

This is not just an implementation issue in Ruby, and you can read a great comment on Reddit illustrating this. The whole premise of a general timeout method that will interrupt an arbitrary block of code like this is flawed. Here’s the API again:

require 'timeout'
status = Timeout::timeout(5) {
  # Something that should be interrupted if it takes more than 5 seconds...
}

There is no way to safely interrupt an arbitrary block of code. Anything could be happening at the end of that 5 seconds.

However! All is not lost if we would like to interrupt our threads. Let’s turn to Java again! (you know all the times we say Ruby is more fun than Java? TODAY JAVA IS MORE FUN BECAUSE IT MAKES MORE SENSE.) Java has a Thread.interrupt method, which sends InterruptedException to a thread. But an InterruptedException is only allowed to be thrown at specific times, for instance during Thread.sleep. Otherwise the thread needs to explicitly call Thread.interrupted() to see if it’s supposed to stop.

On documentation

I don’t know. It’s possible that everybody knew that Timeout was a disaster except for me, and that I should have thought more carefully about what the implications of this Thread.raise were. But I’m thinking of making a pull request on the Ruby documentation with slightly stronger language than

Timeout provides a way to auto-terminate a potentially long-running operation if it hasn’t finished in a fixed amount of time.

and

Raises an exception (see Kernel::raise) from thr. The caller does not have to be thr.

The Java approach (where they deprecated it with a strong warning and then disabled the method entirely) seems more like the right thing.


Read the original article

Comments

  • By simonw 2024-06-0310:537 reply

    I'm fascinated by this general problem - I see it as a close relative of the "sandbox" problem, where I want to be able to safely run user-defined untrusted operations.

    For timeouts, I want to be able to build web application features which are guaranteed to terminate early after eg 2s in a way that resets all resources related to the operation.

    The best way I've found of doing this so far involves proceses: run the time-bound operation in an entirely separate process and terminate it early if necessary.

    I do that with "rg" here: https://github.com/simonw/datasette-ripgrep/blob/a40161f6273...

    • By hyperpape 2024-06-0312:091 reply

      Even processes are not safe if they modify state outside of themselves (files on the filesystem or whatever else).

      It seems to me that the root issue is that encapsulation becomes our enemy here. We want to be able to call some code and not care about the details of how it is implemented. But reliable cancellation requires some sort of design that lets us know exactly what side-effects code has, control them, and force them to be done in a way that's transactional, while still letting us cancel the operation at the right time. We don't want a black box.

      I suspect the two general solutions are: 1) elbow grease: manually inspecting the code that you're calling to understand whether it's safe to cancel and designing a bespoke sandbox: a thread in some cases, a process in others, re-architecting the code you're calling in others... 2) something about algebraic effects or capability systems that I can't speak to because I only have vague ideas how they work and haven't applied them in anger.

      • By ekimekim 2024-06-0314:122 reply

        > Even processes are not safe if they modify state outside of themselves (files on the filesystem or whatever else).

        All processes should be prepared for a sudden crash without corrupting state. Things get OOM killed, machines get unplugged, networks go offline. We have things like journals and write-ahead-logs and at-least-one messaging precisely because of these kinds of problems. If your process can't handle being terminated at any time then you have problems regardless of if it's wrapped in timeout code.

        Doing this for a block of code within a process is a lot harder, because a) there's generally more surface area of externally-observable state, and b) code does not normally need to be prepared to handle it.

        • By joosters 2024-06-0315:274 reply

          Your comment boils down to 'all code should be perfect'. Which is a lovely request, but doesn't really help.

          In particular, I'd challenge you to find one large program that handles OOM situations 100% correctly, especially since most code runs atop an OS configured to over-allocate memory. But even if it wasn't, I doubt there's any sizeable codebase that handles every memory allocation failure correctly and gracefully.

          • By brookst 2024-06-0315:38

            I don't think GP's statement was that all code everywhere must be perfect.

            Just that code which is designed to be run in a separate process with the express intent of allowing termination at timeout should also be designed to not change state outside of itself. If somehow it really needs to (e.g. a 1TB working file), either that process or its invoker needs to have cleanup code that assumes processes have been terminated.

            Doesn't mean that ALL code needs to be perfect, or even that this code will be, just that a thoughtful design will at least create the requirements and tests that make this model work.

          • By chuckadams 2024-06-0322:28

            Not perfect, but “crash-only” or at least as robust as such. Probably involving transactions of some sort. It is indeed a tall order, but if you’re sending kill signals to threads from the outside, that’s the reality you’re in. Find another abort mechanism if that’s too big an ask (and in most cases it justifiably is, that's why Java doesn't even allow it anymore)

          • By treflop 2024-06-0410:37

            You can never be perfect enough where you won’t drop crumbs on your keyboard but if you make a rule to never eat near your computer, you will never drop crumbs on your keyboard.

            Writing high quality code is not about being perfect — it’s about changing how you write code.

            Writing code that can recover from unexpected errors at any point is simply by being mindful about ordering your instructions and persisting data precisely.

          • By throw10920 2024-06-043:141 reply

            > Your comment boils down to 'all code should be perfect'.

            This is clearly false from a cursory skim of the parent comment, which says

            > All processes should be prepared for a sudden crash without corrupting state.

            ...which is leagues away from "all code should be perfect" - for instance, an obvious example is programs that can crash at any time without state corruption, but contain logic bugs that result in invalid data being (correctly) written to disk, independent of crashing or being killed.

            • By farley13 2024-06-064:47

              The convo has been a bit light on examples - I think a canonical example of how to achieve this can be found in ACID databases and high uptime mainframes. Definitely doable, but if performance and large scale data is also a concern, doing it from scratch (eg. starting with file handles) is probably a 5-10+ year process to write your own stable db.

        • By saagarjha 2024-06-0510:57

          This is generally pretty slow and complicated.

    • By everforward 2024-06-0314:254 reply

      I've become quite partial to Go's implementation. It uses a context.Context that may or may not have one of a few ways of communicating that whatever has this context should stop processing (I.e. a timeout, a deadline, or I believe one or two others).

      That context then has a .Done() method that returns a channel; when a value is written to that channel, whatever functions are using that context are expected to stop themselves at the soonest point that makes sense to them.

      Typically this is done inside a for loop in long-running processes. I.e. for something that copies, it looks like

          for {
              select {
                  case <-ctx.Done():
                       // we should stop and return a timeout error or something
                  default:
                       // copy some number of bytes, or check if a network call is done
              }
          }
      
      It does require all of the involved functions to implement support for this, though I think most things do at this point. I wouldn't call a library high quality unless it supports context.Context for long-running operations.

      It gives library authors the ability to determine at what points their code can be interrupted, run cleanup code as part of the timeout, etc.

      > The best way I've found of doing this so far involves proceses: run the time-bound operation in an entirely separate process and terminate it early if necessary.

      This doesn't handle remote resources cleanly, does it? E.g. if I were to lock a Postgres table for a query, and that query times out, will that correctly unlock the table and close the client? Or e.g. lock files? I'm sure some of that can be handled very carefully by managing it in the main process, but that seems error prone.

      • By TheDong 2024-06-0315:362 reply

        > It does require all of the involved functions to implement support for this, though I think most things do at this point.

        Except for reading and writing data from a file using 'os.File', or reading and writing data from a network socket using a 'net.Conn'.

        Support for contexts is quite lacking in that the 'io.Writer' and 'io.Reader' interface don't have it, and those are the most important places to have it.

        Context also has the problem of waiting for cancellation to complete.

        Once you call "cancel()", it async tells a lot of goroutines to teardown, but it's painfully hard to know when they've noticed the cancellation and halted work, which in practice often leads to very subtle data-races.

        > [Terminating processes] doesn't handle remote resources cleanly, does it? E.g. if I were to lock a Postgres table for a query, and that query times out, will that correctly unlock the table and close the client? Or e.g. lock files?

        Both postgres and file locks will correctly handle cleanup if the process dies (postgres notices the connection is dead and ends the transaction, the kernel releases filesystem locks a process is holding when it terminates).

        This is necessary because a process may exit basically at any time for any number of reasons, such as the kernel OOM-killing it.

        • By everforward 2024-06-0316:102 reply

          > Except for reading and writing data from a file using 'os.File', or reading and writing data from a network socket using a 'net.Conn'.

          > Support for contexts is quite lacking in that the 'io.Writer' and 'io.Reader' interface don't have it, and those are the most important places to have it.

          In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

          There is an edge case where it hangs (e.g. on writing to a crappy NFS share) but to the best of my knowledge, that stems from the kernel not being able to interrupt already-queued IO and some knock-on effects related to PIDs owning FDs. E.g. `ls` can't be interrupted when trying to list an NFS dir that's unstable.

          Would love to be told I'm wrong there if I am.

          > Once you call "cancel()", it async tells a lot of goroutines to teardown, but it's painfully hard to know when they've noticed the cancellation and halted work, which in practice often leads to very subtle data-races.

          I typically just defer a function in the goroutine that either writes to an "IsDead" channel or sets a mutex-protected boolean (depending on whether I need a single notification that it's dead, or a persistent way to check whether it's dead). It's not as simple as I'd like, but it's also not terribly hard.

          > Both postgres and file locks will correctly handle cleanup if the process dies (postgres notices the connection is dead and ends the transaction, the kernel releases filesystem locks a process is holding when it terminates).

          I was under the impression that it takes time for Postgres to notice the connection is dead; am I incorrect there? I thought that if a process terminates unexpectedly, Postgres would wait for its own timeout before terminating the client and freeing any resources used by it. I know it won't leak memory for forever, but having a table locked for 30 extra seconds could be a big problem in some situations (i.e. a monolithic DB that practically the whole company uses).

          • By TheDong 2024-06-040:44

            > In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

            So don't use 'io.ReadAll' or 'io.Copy' since they don't take a context thus don't internally do what you're suggesting. I guess the stdlib authors don't know how to use context either.

            Anyway, `reader.Read()`, even with just 1KB, can still take arbitrarily long. There's plenty of cases where you wait minutes or hours for data on a socket, and waiting that long to respect a context cancellation is of course unacceptable.

            > Postgres .. connection timeout

            Killing a process closes all its file descriptors, including sockets, and closing the tcp socket should cause the kernel to send a FIN to the server. Postgres should react to the client end of the socket closing pretty quickly.

            This does rely on you using the linux kernel tcp stack, not a userspace tcp stack (in which case all bets are off), but in practice that's pretty much always the case.

          • By fl0ki 2024-06-050:05

            > In a context world, you would use io.Writer/Reader or net.Conn to write small bits of data and check whether the context is cancelled in between 1KB writes (or whatever size).

            That can still block pretty much indefinitely. Imagine you're a client reading from a server, but the server isn't in any hurry to send anything, and keepalives are keeping the TCP connection open, and no network blips occur for months, so your goroutine is blocked on that read for months.

            The much simpler and more robust thing is to propagate context cancellation to socket close. The close will abort any blocked reads or writes.

            e.g.

                go func() {
                  <-ctx.Done()
                  _ = conn.Close()
                }()
            
            You'll still observe and return an error in the read/write call, and close is idempotent, so this doesn't take anything away from your existing logic and really just acts as a way to propagate cancellation.

            I don't know how well this works for other types of closeable reader/writer implementations. It may not even be thread-safe for some of them. But this worked great when I tried it for sockets.

            > I typically just defer a function in the goroutine that either writes to an "IsDead" channel or sets a mutex-protected boolean

            I try to just use `errgroup` whenever possible, even if no error can be returned. It's just the most fool-proof way I've found to make sure you return only when all nested goroutines are returned, and if you're consistent about it then this applies recursively too. It's a way to fake "structured concurrency" with quite readable code and very few downsides.

        • By oefrha 2024-06-0317:02

          Sockets and pipes generally have SetReadDeadline() and SetWriteDeadline(). With io.Reader and io.Writer in general you have to resort to a separate goroutine and a channel, otherwise they would have to conform to more restricted interfaces, say ReadDeadliner/WriteDeadliner, which is not always possible.

      • By fl0ki 2024-06-0316:14

        At least two correctness risks remain with Go's approach:

        goroutines observe this cancellation asynchronously. You can cancel an operation from your point of view, and begin another one (a retry of the first, or another operation altogether), but the original one is still running, creating side effects that get you into unintended states. If one can be running, potentially any number can be. You have to make sure to actually join on all past operations completing before beginning any new ones, and not all libraries give you a way to synchronously join on asynchronous operations. If you write your own, it's very possible, it just takes a lot of care.

        When you select { } multiple non-default arms like this, and more than one of them is "ready", which one gets selected is random. This avoids starvation and is the right way to implement select { }, but most code that checks for cancellation incorrectly pretends this is not the case and that it will observe cancellation at the earliest possible time. It actually has an exponential probability series of observing cancellation later and later, compounding with the above issue. If the work done between select is long (e.g. CPU or IO work) this compounds even further. The correct solution is to select for cancellation again on just one non-default arm, but that is not "idiomatic" so nobody does it.

        All of this is manageable with due care. Some libraries make it impossible because they kindly encapsulate not just what you don't need to know but what you actually do need to know if you want correct deterministic behavior. In my experience, very few deveopers, even of popular libraries, actually understand the semantics Go promises and how to build correct mechanisms out of them.

      • By acaloiar 2024-06-0315:221 reply

        The context done channels are clearly the way when dealing with all native Go code.

        Allthough to the grandparent's point, whne you're dealing with executables or libraries outside of your control, the only true way I know of to get a "handle" on them is to create a process, with its pid becoming your handle.

        In situations like image processing, conversion, video transcoding, document conversion, etc. you're often dealing with non-Go-native libraries (although this problem transcends language), and there's no way to time-bound processes. That is to say that you often need to consider the Halting Problem and putting time bounds and preemption around execution. So what I've had good success with is adding a process manager around those external processes, and when a timeout or deadline is missed, kill the pid. You can also give users controls to kill processes.

        Obviously there are considerations with resource cleanup and all sorts of negative consequences to this, depending on your use case, but it does provide options for time bounding and preempting things that are otherwise non-preemptable.

        • By everforward 2024-06-0315:401 reply

          Ahh, I hadn't considered operating across languages. That does make it awkward if you can't inject some Go (or other) controls in the middle by having Go manage the loop and only calling incremental processing in the other library.

          That is awkward. My first thought is "just don't use the library" but that's obviously a non-starter for a lot of things, and my second thought was "transpile it" which sounds worse.

          I suppose the signals do allow the binary/library to do its own cleanup if it's well-behaved, so it's really a binary/library quality issue at the end of the day as is something Go/Python/whatever native. There isn't a massive semantic difference between ctx.Done() and a SIGHUP handler; a SIGHUP handler can also defer killing the process until a sane point after cleanup.

      • By wrs 2024-06-0319:00

        All processes can crash at any time due to out-of-memory, bugs, hardware failures, etc. so this should not introduce additional inter-process failure modes. It may reveal existing failure modes, of course!

    • By rcxdude 2024-06-0311:53

      The general issue is isolation, for sure: you need to be able to cleanup the operation regardless of what state it is in when you abort it, and that means that anything you share with it needs to be designed for this. Threads in most OSs simply don't provide enough isolation to do this, two threads share enough resources that you can't effectively do this (in part because many other parts of the system assume that that threads are not an isolation barrier), wheras processes and the means of sharing things between processes are specifically designed so that this is possible.

    • By dools 2024-06-0311:291 reply

      I recently had to impose a timeout on a dbus operation using python and the only way I could get it working reliably was with a subprocess. Everything I tried using threads was crazily unreliable.

      • By bheadmaster 2024-06-0311:411 reply

        > the only way I could get it working reliably was with a subprocess

        Totally agree.

        I've worked in the past on a multi-camera computer vision system in which I used video4linux to capture images from USB cameras which would sometimes misbehave and block on an ioctl() forever. Not even SIGALRM helped.

        The only mechanism that could be properly interrupt this situation was a Process, via SIGKILL, which caused OS to clean up all the resources afterwards. Eventually, we connected the cameras to a USB-controlled relay that mechanically replugged the cameras after a certain number of timeouts (ioctl freezes) were exceeded one after another.

        • By coryrc 2024-06-0314:29

          > multi-camera computer vision system

          That sounds interesting too.

    • By immibis 2024-06-0312:471 reply

      You can sandbox time limits if you firewall state. If nothing in the code being timed out can affect its caller other than raising a "thing has timed out" error, then timing out is safe. Process boundaries provide this firewall on most systems, but they don't have to be the only boundary. I once read about an academic language that had this feature - I don't remember what it was called. It had checked and unchecked exceptions like Java. Checked exceptions are thrown, declared and caught like regular exception control flow. Unchecked exceptions can be raised anywhere, and can only be caught in a way that creates a state firewall. I don't remember how that was enforced.

      • By nijave 2024-06-0322:29

        In such a system each component still needs to be interruptible. If you're waiting on a blocking operation like disk or network I/O, the caller might give up but the request will still hang around until completion at which point it gets thrown out.

        In the case of a blocked process, you may be able to force kill it at a system level but then you risk uncleaned up state (leaked connections/resources)

        For instance, this is the default behavior with Postgres (queries complete even if the connection is closed)

    • By wongarsu 2024-06-0319:341 reply

      In cooperative settings, say when writing your own web server, .Net Core does this quite well with CancellationToken. Basically it's just a convenient synchronized bool that everyone passes to their callees and occasionally checks to see if they should abort what they are doing. Most async APIs have overloads that take an CancellationToken, so even those operations are cancelled as soon as you cancel the token. They are really useful to impose time limits and for making sure you stop processing a request after it was aborted by the user. And because there isn't much magic it isn't too hard to make sure you reset your resources.

      But that only works if you trust the code you are executing. If you don't you pretty much have to either use the primitives provided by your OS or run your own interpreter (LUA and WASM are popular for a reason)

      • By nijave 2024-06-0322:23

        In go there's contexts that do something similar

        .NET and go seem to fairly good around request cancellation facilities

    • By Sohcahtoa82 2024-06-0417:00

      I've had a similar problem before.

      I wanted to add a "!calc" trigger to an IRC bot written in Python. Rather than implement the Shunting Yard Algorithm [0] to make sure PEMDAS was handled correctly, I wanted to use `eval` (With some extra steps with the AST library to make a whitelist of AST nodes to prevent code injection), but with Python supporting essentially infinite integer precision, someone writing "!calc 9 * 9 * 9 * 9 * 9 * 9 * [..]" would eat up my entire CPU.

      I solved it the same way. I execute it in a separate process and terminate it if it takes more than a couple seconds.

      [0] https://en.wikipedia.org/wiki/Shunting_yard_algorithm

  • By mijoharas 2024-06-0310:39

    Interesting, there's a ruby feature request[0] for a safer thread api (this has been a known issue for a long time), and I just saw that it got assigned a couple of months ago. Maybe this'll get addressed in the next ruby version.

    [0] https://bugs.ruby-lang.org/issues/17849

  • By llimllib 2024-06-0312:19

    I hit a pretty brutal (and fun, if you are into this sort of thing) problem with Timeout 11 years ago: https://stackoverflow.com/questions/17237743/timeout-within-...

HackerNews