It's the end of observability as we know it (and I feel fine)

2025-06-110:34300208www.honeycomb.io

The history of observability tools over the past decade has been about a pretty simple concept, but LLMs bring the death of that paradigm.

In a really broad sense, the history of observability tools over the past couple of decades have been about a pretty simple concept: how do we make terabytes of heterogeneous telemetry data comprehensible to human beings? New Relic did this for the Rails revolution, Datadog did it for the rise of AWS, and Honeycomb led the way for OpenTelemetry.

The loop has been the same in each case. New abstractions and techniques for software development and deployment gain traction, those abstractions make software more accessible by hiding complexity, and that complexity requires new ways to monitor and measure what’s happening. We build tools like dashboards, adaptive alerting, and dynamic sampling. All of these help us compress the sheer amount of stuff happening into something that’s comprehensible to our human intelligence.

In AI, I see the death of this paradigm. It’s already real, it’s already here, and it’s going to fundamentally change the way we approach systems design and operation in the future.

New to Honeycomb? Get your free account today.

I’m going to tell you a story. It’s about this picture:

LLMs are just universal function approximators, but it turns out that those are really useful

If you’ve ever seen a Honeycomb demo, you’ve probably seen this image. We love it, because it’s not only a great way to show a real-world problem—it’s something that plays well to our core strengths of enabling investigatory loops. Those little peaks you see in the heatmap represent slow requests in a frontend service that rise over time before suddenly resetting. They represent a small percentage of your users experiencing poor performance—and we all know what this means in the real world: lost sales, poor experience, and general malaise at the continued enshittification of software.

In a Honeycomb demo, we show you how easy it is to use our UI to understand what those spikes actually mean. You draw a box around them, and we run BubbleUp to detect anomalies by analyzing the trace data that’s backing this visualization, showing you what’s similar and what’s different between the spikes and the baseline. Eventually, you can drill down to the specific service and even method call that’s causing the problem. It’s a great demo, and it really shows the power of our platform.

Last Friday, I showed a demo at our weekly internal Demo Day. It started with what I just showed you, and then I ran a single prompt through an AI agent that read as follows:

Please investigate the odd latency spikes in the frontend service that happen every four hours or so, and tell me why they’re happening.

The screenshot here elides the remainder of the response from the LLM (please find the entire text at the end of this post), but there’s a few things I want to call out. First, this wasn’t anything too special. The agent was something I wrote myself in a couple of days; it’s just an LLM calling tools in a loop. The model itself is off-the-shelf Claude Sonnet 4. The integration with Honeycomb is our new Model Context Protocol (MCP) server. It took 80 seconds, made eight tool calls, and not only did it tell me why those spikes happened, it figured it out in a pretty similar manner to how we’d tell you to do it with BubbleUp.

This isn’t a contrived example. I basically asked the agent the same question we’d ask you in a demo, and the agent figured it out with no additional prompts, training, or guidance. It effectively zero-shot a real-world scenario.

And it did it for sixty cents.

I want to be clear, this was perhaps the least optimized version of this workflow. Inference costs are only going down, and we can certainly make our MCP server more efficient. There are ways to reduce the amount of input tokens even more. We can play around with more tailored aggregations and function calls that return LLM-optimized query results. It’s an exciting new era!

It also should serve as a wakeup call to the entire industry. This is a seismic shift in how we should conceptualize observability tooling. If your product’s value proposition is nice graphs and easy instrumentation, you are le cooked. An LLM commoditizes the analysis piece, OpenTelemetry commoditizes the instrumentation piece. The moats are emptying.

I’m not gonna sit here and say this destroys the idea of humans being involved in the process, though. I don’t think that’s true. The rise of the cloud didn’t destroy the idea of IT. The existence of Rails doesn’t mean we don’t need server programmers. Productivity increases expand the map. There’ll be more software, of all shapes and sizes. We’re going to need more of everything.

The question, then, is: what does this require from us? Where does observability sit in a world where code is cheap, refactors are cheap, and analysis is a constant factor?

I’m gonna put a marker out there: the only thing that really matters is fast, tight feedback loops at every stage of development and operations. AI thrives on speed—it’ll outrun you every time. To succeed, you need tools that move at the speed of AI as well. The slower your analysis engine, the worse your results. LLMs will generate, test, and discard hypotheses faster than you ever can. They might be wrong a dozen times before they get it right, but again, we’re talking about fractional pennies here

I don’t see this as a reason to despair, but as something to celebrate. The things that AI-assisted development and operations can lead us to are the things we’ve always been about. Honeycomb’s entire modus operandi is predicated on fast feedback loops, collaborative knowledge sharing, and treating everything as an experiment. We’re built for the future that’s here today, on a platform that allows us to be the best tool for tomorrow.

The exact contours of what this industry looks like in two, five, or ten years are out of my view. I don’t think there’s anyone that can credibly tell you what things will look like. I’m also not really in the business of making predictions. What I can say, though, is that I think we’re going to see a spectrum of AI assistance in development and operational tasks across the entire software development lifecycle:

  • AI agents can run passively as you write and deploy code, suggesting fixes to improve code quality based on real-world observations from production.
  • Emergent system behavior can be detected and addressed by AI, with unattended agents identifying impacts to users, investigating them, compiling reports, and presenting those to human operators for approval and tweaks before making changes.
  • Bleeding-edge organizations can adopt entirely unattended SWE/SRE roles, powered by tools like Honeycomb and friends, optimizing and writing software to meet business needs and goals.

In all of these cases, what matters most is speed. Traditional monitoring and observability tools don’t have what it takes to compete in this world. Pretty dashboards, pre-built alerts, and magic integrations aren’t sufficient for the future we’re building today. Sub-second query performance, unified data storage, and collaborative workflows between humans and AI are the keys to success in the future of observability.

Let’s go do something cool.

Addendum: Here’s the full output from the investigation screenshot earlier.

Perfect! Now I have a clear picture of what’s happening. Based on my investigation, I can explain the latency spikes you’re seeing in your frontend service.

🔍 Investigation Results: Frontend Latency Spikes

Root Cause Identified: Checkout Service Performance Issues

The latency spikes you’re experiencing every ~4 hours are NOT actually periodic—this appears to be a load testing or synthetic traffic pattern rather than a scheduled system issue. Here’s what I found:

📊 Key Findings:

  1. Primary Culprit: Checkout Operations
    • /cart/checkout endpoint: P95 = 3.7 seconds (vs normal ~500ms)
    • CheckoutService/PlaceOrder: P95 = 4.2 seconds
    • Maximum latency observed: 7.6 seconds
  2. Spike Pattern Analysis:
    • Spikes are concentrated in checkout operations only
    • Other endpoints remain relatively stable
    • 485 slow checkout operations in the last 24 hours
    • Pattern suggests load testing or batch checkout processing
  3. Downstream Service Impact:
    • CheckoutService/PlaceOrder: P95 = 4.6 seconds
    • getDiscounts operation: P95 = 4.2 seconds
    • ShippingService/ShipOrder: P95 = 4.0 seconds

🕵️ Evidence from Trace Analysis:

Looking at the sample traces, I can see:

  • User Agent Pattern: python-requests/2.28.2 and python-requests/2.31.0
  • Consistent Client IPs: 52.45.250.5 and internal IPs
  • High Cart Values: $6,861, $7,036, $259 (suggesting test scenarios)
  • Same User IDs: Multiple requests from user 20109 with identical cart totals

💡 Why This is Happening:

Most Likely Cause: Load Testing or Synthetic Monitoring

  • The pattern suggests automated testing rather than real user traffic
  • Python requests library indicates scripted traffic
  • High cart values and repeated scenarios point to test data
  • The “every 4 hours” pattern is likely your load testing schedule

Technical Root Cause:

  • Checkout service bottleneck during high-volume operations
  • Discount calculation delays (4.2s P95)
  • Shipping service delays during order processing
  • Possible database contention or external API timeouts

🎯 Recommendations:

  1. Immediate Actions:
    • Verify if this is load testing—check with your testing team
    • Monitor checkout service resources during spike periods
    • Check discount service performance and external dependencies
  2. Performance Optimization:
    • Optimize discount calculation logic
    • Add caching for shipping calculations
    • Implement circuit breakers for external payment/shipping APIs
    • Scale checkout service if this represents real user load
  3. Monitoring Improvements:
    • Separate alerts for synthetic vs. real user traffic
    • Add SLI tracking specifically for checkout operations
    • Monitor downstream dependencies (payment, shipping, discount services)

🚨 Action Items:

  1. Confirm if the 4-hour pattern matches your load testing schedule
  2. If it’s real traffic, investigate checkout service capacity and dependencies
  3. Consider implementing checkout-specific performance monitoring and alerting

The good news is that your other services are performing well—this appears to be an isolated issue with the checkout flow, likely exacerbated by load testing scenarios.


Read the original article

Comments

  • By RainyDayTmrw 2025-06-114:392 reply

    I think we are, collectively, greatly underestimating the value of determinism and, conversely, the cost of nondeterminism.

    I've been trialing a different product with the same sales pitch. It tries to RCE my incidents by correlating graphs. It ends up looking like this page[1], which is a bit hard to explain in words, but both obvious and hilarious when you see it for yourself.

    [1]: https://tylervigen.com/spurious-correlations

    • By graemep 2025-06-115:522 reply

      Its fun, but the point should be well known (i know its not). Time series are very prone to spurious correlations - r² is not useful.

      Its even worse if you just eyeball a graph. If something changes over time, you need to use appropriate measures.

      • By feoren 2025-06-1116:42

        > r² is not useful

        People want so badly to have an "objective" measure of truth that they can just pump data into and get a numeric result that "doesn't lie". r², p < 0.05, χ2, etc. It's too far to say these numbers aren't useful at all -- they are -- but we're just never going to be able to avoid the difficult and subjective task of interpreting experimental results in their broader context and reconciling them with our pre-existing knowledge. I think this is why people are so opposed to anything Bayesian: we don't want to have to do that work, to have those arguments with each other about what we believed before the experiment and how strongly. But the more we try to be objective, the more vulnerable we are to false positives and spurious correlations.

      • By SOLAR_FIELDS 2025-06-1314:42

        I agree - the problem here is probably less that the AI makes the mistake and more that it’s just really easy to make this mistake whether you’re AI or human. It’s probably true that the AI selects more correlations that would be obviously irrelevant to the end user though. I, like others doing SRE work, spend a fair amount of time trawling through these types of graphs and it’s very common to see something that looks like a correlation, look closer, then dismiss it due to noise.

        Time series are also very much subject to MTUP[1]. Something that looks like a correlation or problem at a certain zoom level becomes totally normal behavior when you zoom out, for instance

        1: https://en.m.wikipedia.org/wiki/Modifiable_temporal_unit_pro...

    • By worldsayshi 2025-06-127:04

      Perhaps I'm missing you point a bit but you can absolutely have deterministic UX when it matters with LLM based applications if you design it right. Whenever you need determinism, make the LLM generate a deterministic specification for how to do something and/or record it's actions. And let the user save away re-playable specifications along with the dialogue. Then build ways for the AI to suggest fixes for failing specs when needed.

      It's basically the same flow as when you use AI for programming. Except you need to constrain the domain of the specifications more and reason more about how to allow the AI to recover from failing specifications if you don't want to force the user to learn your specification language.

  • By zug_zug 2025-06-112:371 reply

    As somebody who's good at RCA, I'm worried all my embarrassed coworkers are going to take at face value a tool that's confidently incorrect 10% of the time and screw stuff up more instead of having to admit they don't know something publicly.

    It'd be less bad if the tool came to a conclusion, then looked for data to disprove that interpretation, and then made a more reliably argument or admitted its uncertainty.

    • By jakogut 2025-06-112:562 reply

      You can achieve a good amount of this with system prompts. I've actually had good success using LLMs to craft effective system prompts and custom instructions to get more rigorous and well researched answers by default.

      One I use with ChatGPT currently is:

      > Prioritize substance, clarity, and depth. Challenge all my proposals, designs, and conclusions as hypotheses to be tested. Sharpen follow-up questions for precision, surfacing hidden assumptions, trade offs, and failure modes early. Default to terse, logically structured, information-dense responses unless detailed exploration is required. Skip unnecessary praise unless grounded in evidence. Explicitly acknowledge uncertainty when applicable. Always propose at least one alternative framing. Accept critical debate as normal and preferred. Treat all factual claims as provisional unless cited or clearly justified. Cite when appropriate. Acknowledge when claims rely on inference or incomplete information. Favor accuracy over sounding certain.

      • By j_bum 2025-06-114:091 reply

        Do you just add this to your “instructions” section?

        And what type of questions do you ask the model?

        Thanks for sharing

        • By jakogut 2025-06-1114:30

          Yes, with ChatGPT, I added that paragraph as custom instructions under personalization.

          I ask a wide variety of things, from what a given plant is deficient in based on a photo, to wireless throughout optimization (went from 600 Mbps to 944 Mbps in one hour of tuning). I use models for discovering new commands, tools, and workflows, interpreting command output, and learning new keywords for deeper and more rigorous research using more conventional methods. I rubber duck with it, explaining technical problems, my hypotheses, and iterating over experiments until arriving at a solution, creating a log in the process. The model is often wrong, but it's also often right, and used the way I do, it's quickly apparent when it's wrong.

          I've used ChatGPT's memory feature to extract questions from previous chats that have already been answered to test the quality and usability of local models like Gemma3, as well as craft new prompts in adjacent topics. Prompts that are high leverage, compact, and designed to trip up models that are underpowered or over quantized. For example:

          >> "Why would toggling a GPIO in a tight loop not produce a square wave on the pin?"

          > Tests: hardware debounce, GPIO write latency, MMIO vs cache, bus timing.

          >> "Why is initrd integrity important for disk encryption with TPM sealing?"

          > Tests: early boot, attack surface, initramfs tampering vectors.

          >> "Why would a Vulkan compute shader run slower on an iGPU than a CPU?"

          > Tests: memory bandwidth vs cache locality, driver maturity, PCIe vs UMA.

          >> "Why is async compute ineffective on some GPUs?"

          > Tests: queue scheduling, preemption granularity, workload balance.

          >> "Why might a PID loop overshoot more when sensor update rate decreases?"

          > Tests: delayed ACK, bufferbloat, congestion control tuning.

          >> "How can TCP experience high latency even with low packet loss?"

          > Tests: delayed ACK, bufferbloat, congestion control tuning.

          >> "How can increasing an internal combustion engine's compression ratio improve both torque and efficiency?"

          > Tests: thermodynamics, combustion behavior, fuel octane interaction.

          >> "How can increasing memory in a server slow it down?"

          > Tests: NUMA balancing, page table size, cache dilution.

          >> "Why would turning off SMT increase single-threaded performance on some CPUs?"

          > Tests: resource contention, L1/L2 pressure, scheduling policy.

      • By dr_kiszonka 2025-06-116:26

        This looks excellent. Thanks for sharing.

  • By heinrichhartman 2025-06-117:441 reply

    > New Relic did this for the Rails revolution, Datadog did it for the rise of AWS, and Honeycomb led the way for OpenTelemetry.

    I find this reading of history of OTel highly biased. OpenTelemetry was born as the Merge of OpenCensus (initiated by Google) and OpenTracing (initiated by LightStep):

    https://opensource.googleblog.com/2019/05/opentelemetry-merg...

    > The seed governance committee is composed of representatives from Google, Lightstep, Microsoft, and Uber, and more organizations are getting involved every day.

    Honeycomb has for sure had valuable code & community contributions and championed the technology adoption, but they are very far from "leading the way".

    • By loevborg 2025-06-1111:591 reply

      As someone who recently adopted Honeycomb, it really is an amazing tool. Especially with otel auto-instrumentation, you get insights within a few hours. The dashboard / query features are very powerful and obviously stem from a deep philosophical understanding of observability. My team was shocked at how good the tools is.

      Datadog by contrast seems to be driven by marketing and companies having a "observability" checkbox to tick.

      • By ayewo 2025-06-1115:481 reply

        Which programming languages are you using with the OTel auto-instrumentation feature?

        • By loevborg 2025-06-1119:47

          Node and Python. It's amazing how much works out of the box - express routes, http queries, dns queries, the list goes on

HackerNews