
Why traditional logging fails and how wide events can fix your observability
Your logs are lying to you. Not maliciously. They're just not equipped to tell the truth.
You've probably spent hours grep-ing through logs trying to understand why a user couldn't check out, why that webhook failed, or why your p99 latency spiked at 3am. You found nothing useful. Just timestamps and vague messages that mock you with their uselessness.
This isn't your fault. Logging, as it's commonly practiced, is fundamentally broken. And no, slapping OpenTelemetry on your codebase won't magically fix it.
Let me show you what's wrong, and more importantly, how to fix it.
Logs were designed for a different era. An era of monoliths, single servers, and problems you could reproduce locally. Today, a single user request might touch 15 services, 3 databases, 2 caches, and a message queue. Your logs are still acting like it's 2005.
Here's what a typical logging setup looks like:
The Log Chaos Simulator
Loading interactive demo...
That's 13 log lines for a single successful request. Now multiply that by 10,000 concurrent users. You've got 130,000 log lines per second. Most of them saying absolutely nothing useful.
But here's the real problem: when something goes wrong, these logs won't help you. They're missing the one thing you need: context.
When a user reports "I can't complete my purchase," your first instinct is to search your logs. You type their email, or maybe their user ID, and hit enter.
The Futile Search
Loading interactive demo...
String search treats logs as bags of characters. It has no understanding of structure, no concept of relationships, no way to correlate events across services.
user-123user_id=user-123{"userId": "user-123"}[USER:user-123]processing user: user-123And those are just the logs that include the user ID. What about the downstream service that only logged the order ID? Now you need a second search. And a third. You're playing detective with one hand tied behind your back.
The fundamental problem: logs are optimized for writing, not for querying.
Developers write console.log("Payment failed") because it's easy in the moment. Nobody thinks about the poor soul who'll be searching for this at 2am during an outage.
Before I show you the fix, let me define some terms. These get thrown around a lot, often incorrectly.
Structured Logging: Logs emitted as key-value pairs (usually JSON) instead of plain strings. {"event": "payment_failed", "user_id": "123"} instead of "Payment failed for user 123". Structured logging is necessary but not sufficient.
Cardinality: The number of unique values a field can have. user_id has high cardinality (millions of unique values). http_method has low cardinality (GET, POST, PUT, DELETE, etc.). High cardinality fields are what make logs actually useful for debugging.
Dimensionality: The number of fields in your log event. A log with 5 fields has low dimensionality. A log with 50 fields has high dimensionality. More dimensions = more questions you can answer.
Wide Event: A single, context-rich log event emitted per request per service. Instead of 13 log lines for one request, you emit 1 line with 50+ fields containing everything you might need to debug.
Canonical Log Line: Another term for wide event, popularized by Stripe. One log line per request that serves as the authoritative record of what happened.
Cardinality Explorer
Loading interactive demo...
I see this take constantly: "Just use OpenTelemetry and your observability problems are solved."
No. OpenTelemetry is a protocol and a set of SDKs. It standardizes how telemetry data (logs, traces, metrics) is collected and exported. This is genuinely useful: it means you're not locked into a specific vendor's format.
But here's what OpenTelemetry does NOT do:
1. It doesn't decide what to log. You still have to instrument your code deliberately.
2. It doesn't add business context. If you don't add the user's subscription tier, their cart value, or the feature flags enabled, OTel won't magically know.
3. It doesn't fix your mental model. If you're still thinking in terms of "log statements," you'll just emit bad telemetry in a standardized format.
The OTel Reality Check
Loading interactive demo...
OpenTelemetry is a delivery mechanism. It doesn't know that user-789 is a premium customer who's been with you for 3 years and just tried to spend $160. You have to tell it.
Here's the mental model shift that changes everything:
Instead of logging what your code is doing, log what happened to this request.
Stop thinking about logs as a debugging diary. Start thinking about them as a structured record of business events.
For each request, emit one wide event per service hop. This event should contain every piece of context that might be useful for debugging. Not just what went wrong, but the full picture of the request.
Build a Wide Event
Loading interactive demo...
Here's what a wide event looks like in practice:
{
"timestamp": "2025-01-15T10:23:45.612Z",
"request_id": "req_8bf7ec2d",
"trace_id": "abc123",
"service": "checkout-service",
"version": "2.4.1",
"deployment_id": "deploy_789",
"region": "us-east-1",
"method": "POST",
"path": "/api/checkout",
"status_code": 500,
"duration_ms": 1247,
"user": {
"id": "user_456",
"subscription": "premium",
"account_age_days": 847,
"lifetime_value_cents": 284700
},
"cart": {
"id": "cart_xyz",
"item_count": 3,
"total_cents": 15999,
"coupon_applied": "SAVE20"
},
"payment": {
"method": "card",
"provider": "stripe",
"latency_ms": 1089,
"attempt": 3
},
"error": {
"type": "PaymentError",
"code": "card_declined",
"message": "Card declined by issuer",
"retriable": false,
"stripe_decline_code": "insufficient_funds"
},
"feature_flags": {
"new_checkout_flow": true,
"express_payment": false
}
}
user_id = "user_456" and you instantly know:
No grep-ing. No guessing. No second search.
With wide events, you're not searching text anymore. You're querying structured data. The difference is night and day.
Query Playground
Loading interactive demo...
This is the superpower of wide events combined with high-cardinality, high-dimensionality data. You're not searching logs anymore. You're running analytics on your production traffic.
Here's a practical implementation pattern. The key insight: build the event throughout the request lifecycle, then emit once at the end.
// middleware/wideEvent.ts
export function wideEventMiddleware() {
return async (ctx, next) => {
const startTime = Date.now();
// Initialize the wide event with request context
const event: Record<string, unknown> = {
request_id: ctx.get('requestId'),
timestamp: new Date().toISOString(),
method: ctx.req.method,
path: ctx.req.path,
service: process.env.SERVICE_NAME,
version: process.env.SERVICE_VERSION,
deployment_id: process.env.DEPLOYMENT_ID,
region: process.env.REGION,
};
// Make the event accessible to handlers
ctx.set('wideEvent', event);
try {
await next();
event.status_code = ctx.res.status;
event.outcome = 'success';
} catch (error) {
event.status_code = 500;
event.outcome = 'error';
event.error = {
type: error.name,
message: error.message,
code: error.code,
retriable: error.retriable ?? false,
};
throw error;
} finally {
event.duration_ms = Date.now() - startTime;
// Emit the wide event
logger.info(event);
}
};
}
Then in your handlers, you enrich the event with business context:
app.post('/checkout', async (ctx) => {
const event = ctx.get('wideEvent');
const user = ctx.get('user');
// Add user context
event.user = {
id: user.id,
subscription: user.subscription,
account_age_days: daysSince(user.createdAt),
lifetime_value_cents: user.ltv,
};
// Add business context as you process
const cart = await getCart(user.id);
event.cart = {
id: cart.id,
item_count: cart.items.length,
total_cents: cart.total,
coupon_applied: cart.coupon?.code,
};
// Process payment
const paymentStart = Date.now();
const payment = await processPayment(cart, user);
event.payment = {
method: payment.method,
provider: payment.provider,
latency_ms: Date.now() - paymentStart,
attempt: payment.attemptNumber,
};
// If payment fails, add error details
if (payment.error) {
event.error = {
type: 'PaymentError',
code: payment.error.code,
stripe_decline_code: payment.error.declineCode,
};
}
return ctx.json({ orderId: payment.orderId });
});
Wide Event Builder Simulator
Loading interactive demo...
"But Boris," I hear you saying, "if I log 50 fields per request at 10,000 requests per second, my observability bill will bankrupt me."
Valid concern. This is where sampling comes in.
Sampling means keeping only a percentage of your events. Instead of storing 100% of traffic, you might store 10% or 1%. At scale, this is the only way to stay sane (and solvent).
But naive sampling is dangerous. If you randomly sample 1% of traffic, you might accidentally drop the one request that explains your outage.
The Sampling Trap
Loading interactive demo...
Tail sampling means you make the sampling decision after the request completes, based on its outcome.
The rules are simple:
1. Always keep errors. 100% of 500s, exceptions, and failures get stored.
2. Always keep slow requests. Anything above your p99 latency threshold.
3. Always keep specific users. VIP customers, internal testing accounts, flagged sessions.
4. Randomly sample the rest. Happy, fast requests? Keep 1-5%.
This gives you the best of both worlds: manageable costs, but you never lose the events that matter.
// Tail sampling decision function
function shouldSample(event: WideEvent): boolean {
// Always keep errors
if (event.status_code >= 500) return true;
if (event.error) return true;
// Always keep slow requests (above p99)
if (event.duration_ms > 2000) return true;
// Always keep VIP users
if (event.user?.subscription === 'enterprise') return true;
// Always keep requests with specific feature flags (debugging rollouts)
if (event.feature_flags?.new_checkout_flow) return true;
// Random sample the rest at 5%
return Math.random() < 0.05;
} No. Structured logging means your logs are JSON instead of strings. That's table stakes. Wide events are a philosophy: one comprehensive event per request, with all context attached. You can have structured logs that are still useless (5 fields, no user context, scattered across 20 log lines).
You're using a delivery mechanism. OpenTelemetry doesn't decide what to capture. You do. Most OTel implementations I've seen capture the bare minimum: span name, duration, status. That's not enough. You need to deliberately instrument with business context.
Tracing gives you request flow across services (which service called which). Wide events give you context within a service. They're complementary. Ideally, your wide events ARE your trace spans, enriched with all the context you need.
This distinction is artificial and harmful. Wide events can power both. Query them for debugging. Aggregate them for dashboards. The data is the same, just different views.
It's expensive on legacy logging systems built for low-cardinality string search. Modern columnar databases (ClickHouse, BigQuery, etc.) are specifically designed for high-cardinality, high-dimensionality data. The tooling has caught up. Your practices should too.
When you implement wide events properly, debugging transforms from archaeology to analytics.
Instead of: "The user said checkout failed. Let me grep through 50 services and hope I find something."
You get: "Show me all checkout failures for premium users in the last hour where the new checkout flow was enabled, grouped by error code."
One query. Sub-second results. Root cause identified.
Your logs stop lying to you. They start telling the truth. The whole truth.
Complete the form below to get a personalized report on your stack. I'll tell you what's working, what's not, and where you can save money. I genuinely want to hear about your logging nightmares :)
That was difficult to read, smelt very AI assisted though the message was worthwhile, it could've been shorter and more to the point.
A few things I've been thinking about recently:
- we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.
- logging an error as a separate log line to the request log is a pain. You can filter for the trace, but it makes it hard to surface "show me all the logs for 5xx requests and the error associated" - it's doable, but it's more difficult than filtering on the status code of the request log
- it's not enough to just start including that context, you have to educate your coworkers that it's now present. I've seen people making life hard for themselves because they didn't realize we'd added this context
On the other hand, investing in better tracing tools unlocks a whole nother level of logging and debugging capabilities that aren't feasible with just request logs. It's kind of like you mentioned with using the user id as a "trace" in your first message but on steroids.
These tools tend to be very expensive in my experience unless you are running your own monitoring cloud. Either you end up sampling traces at low rates to save on costs, or your observability bill is more than your infrastructure bill.
We self host Grafana Tempo and whilst the cost isn’t negligible (at 50k spans per second), the money saved in developer time when debugging an error, compared to having to sift through and connect logs, is easily an order of magnitude higher.
Doing stuff like turning on tracing for clients that saw errors in the last 2 minutes, or for requests that were retried should only gather a small portion of your data. Maybe you can include other sessions/requests at random if you want to have a baseline to compare against.
Try open-source databases specially designed for traces, such as Grafana Tempo or VictoriaTraces. They can handle the data ingestion rate of hundreds of thousands trace spans per second on a regular laptop.
I like to write them on my own in every company Im in using bash. So I have a local set of bash commands to help me figure out logs and colorize the items I want to.
Takes some time and its a pain in the ass initially, but once I've matured them - work becomes so much more easy. Reduces dependability on other people / teams / access as well.
Edit: Thinking about this, they wont work in other use cases. Im a data engineer so my jobs are mostly sequential.
I've tried HyperDX and SigNoz, they seem easy to self-host and decent enough
If your codebase has the concept of a request ID, you could also feasibly use that to trace what a user has been doing with more specificity.
…and the same ID can be displayed to user on HTTP 500 with the support contact, making life of everyone much easier.
I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense. UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.
A good compromise is to log whenever a user would see the error code, and treat those events with very high priority.
We put the error code behind a kind of message/dialog that invites the user to contact us if the problem persists and then report that code.
It’s my long standing wish to be able to link traces/errors automatically to callers when they call the helpdesk. We have all the required information. It’s just that the helpdesk has actually very little use for this level of detail. So they can only attach it to the ticket so that actual application teams don’t have to search for it.
> I have seen pushback on this kind of behavior because "users don't like error codes" or other such nonsense […]
There are two dimensions to it: UX and security.
Displaying excessive technical information on an end-user interface will complicate support and likely reveal too much about the internal system design, making it vulnerable to external attacks.
The latter is particularly concerning for any design facing the public internet. A frequently recommended approach is exception shielding. It involves logging two messages upon encountering a problem: a nondescript user-facing message (potentially including a reference ID pinpointing the problem in space and time) and a detailed internal message with the problem’s details and context for L3 support / engineering.
Sorry for the OT response, I was curious about this comment[0] you made a while back. How did you measure memory transfer speed?
I used «powermetrics» bundled with macOS with «bandwidth» as one of the samplers (--samplers / -s set to «cpu_power,gpu_power,thermal,bandwidth»).
Unfortunately, Apple has taken out the «bandwidth» sampler from «powermetrics», and it is no longer possible to measure the memory bandwidth as easily.
> UX and Product like to pretend nothing will ever break, and when it does they want some funny little image, not useful output.
Just ignore them or provide appeasement insofar that it doesn’t mess with your ability to maintain the system.
(cat picture or something)
Oh no, something went wrong.
Please don’t hesitate to reach out to our support: (details)
This code will better help us understand what happened: (request or trace ID)Nah, that’s easy problem to solve with UX copy. „Something went wrong. Try again or contact support. Your support request number is XXXX XXXX“ (base 58 version of UUID).
We do have both a span id and trace id - but I personally find this more cumbersome over filtering on a user id. YMMV if you're interested in a single trace then you'd filter for that, but I find you often also care what happened "around" a trace
If you care about this more than anything else (e.g. if you care about audits a LOT and need them perfect), you can simply code the app via action paths, rather than for modularity. It makes changes harder down the road, but for codebases that don’t change much, this can be a viable tradeoff to significantly improve tracing and logging.
...if it does not, you should add it. A request ID, trace ID, correlation key, whatever you call it, you should thread it through every remote call, if you value your sanity.
TIDs are good here too. If you generate it and enforce it across all your services spanning various teams and APIs anyone of any team can grab a TID you provide and you can get the full end to end of one transaction.
Wow, I didn't think this was badly written at all! I certainly don't think it smells like AI. Are you conflating lists with AI written prose?
> - we have authentication everywhere in our stack, so I've started including the user id on every log line. This makes getting a holistic view of what a user experienced much easier.
Depends on the service, but tracking everything a user does may not be an option in terms of data retention laws
> That was difficult to read, smelt very AI assisted though the message was worthwhile...
It won’t be long before ad computem comments like this are frowned upon.
Why? "This was written badly" is a perfectly normal thing to say; "this was written badly because you didn't put in the effort of writing it yourself" doubly so.
Say they used AI to write it, it came out bad, and they published it anyway. They had the opportunity to "make it better" before publishing, but didn't. The only conclusion for this is, they just aren't good at writing. So whether AI is used or not, it'll suck either way. So there's no need to complain about the AI.
It's like complaining that somebody typed a crappy letter rather than hand-wrote it. Either way the letter's gonna suck, so why complain that it was typed?
Compared to human bad writing, AI writing tends to suck more verbosely and in exciting new ways (e.g. by introducing factual errors).
> AI writing tends to suck more verbosely
So, it's the style you oppose, the way a grammar nazi complains about "improper" English
> and in exciting new ways (e.g. by introducing factual errors).
Because factually incorrect comments didn't exist before AI?
Your concern is that you read something you don't like, so you pick the lowest-effort criteria to complain about. Speaks more about you than the original commenter.
I'm pretty sure by verbose it's the realization you've wasted precious time reading AI bloat that you'll never get back. On top of that, now you need to reread the text for hallucinations or just take a loss and ignore any conclusions at risk that they came from bad data.
> The only conclusion for this is, they just aren't good at writing.
Not true. It's likely an effort issue in that situation.
And that kind of effort issue is good to call out, because it compounds the low quality.
I don't know if you're new to the internet, but low-effort comments have existed before AI, and will continue to exist regardless of AI.
I read it as a more-or-less kind comment: “even though you’ll notice that they let an AI make the writing terrible, the underlying point is good enough to be worth struggling through that and discussing”
I felt unsure whether to include that particular comment, but landed on including because I think it's a real danger. I've got no problem with people using AI and do use it for some things myself.
However I don't think you should outsource understanding to LLMs, and also think that shifting the effort from the writer to the reader is a poor strategy (and disrespectful to the reader)
edit: in case it's unclear I'm not accusing the author of having outsourced their understanding to AI, but I think it's a real risk that people can fall into, the value is in the thinking people put into things not the mechanics of typing it out
A post on this topic feels incomplete without a shout-out to Charity Majors - she has been preaching this for a decade, branded the term "wide events" and "observability", and built honeycomb.io around this concept.
Also worth pointing out that you can implement this method with a lot of tools these days. Both structured Logs or Traces lend itself to capture wide events. Just make sure to use a tool that supports general query patterns and has rich visualizations (time-series, histograms).
> she has been preaching this for a decade, branded the term "wide events" and "observability",
With all due respect to her other work, she most certainly did not coin the term “observability”. Observability has been a topic in multiple fields for a very long time and has had widespread usage in computing for decades.
I’m sure you meant well by your comment, but I doubt this is a claim she even makes for herself.
She has been an influential writer on the topic and founded a company in this space, but she didn’t actually create the concept or terminology of observability.
> A post on this topic feels incomplete without a shout-out to Charity Majors
I concur. In fact, I strongly recommend anyone who has been working with observability tools or in the industry to read her blog, and the back story that lead to honeycomb. They were the first to recognize the value of this type of observability and have been a huge inspiration for many that came after.
Could you drop a few specific posts here that you think are good for someone (me) who hasn't read her stuff before? Looks like there's a decade of stuff on her blog and I'm not sure I want to start at the very beginning...
A few of my favourites:
- Software Sprawl, The Golden Path, and Scaling Teams With Agency: https://charity.wtf/2018/12/02/software-sprawl-the-golden-pa... - introduces the idea of the "golden path", where you tell engineers at your company that if they use the approved stack of e.g. PostgreSQL + Django + Redis then the ops team will support that for them, but if they want to go off path and use something like MongoDB they can do that but they'll be on the hook for ops themselves.
- Generative AI is not going to build your engineering team for you: https://stackoverflow.blog/2024/12/31/generative-ai-is-not-g... - why generative AI doesn't mean you should stop hiring junior programmers.
- I test in prod: https://increment.com/testing/i-test-in-production/ - on how modern distributed systems WILL have errors that only show up in production, hence why you need to have great instrumentation in place. "No pull request should ever be accepted unless the engineer can answer the question, “How will I know if this breaks?”"
- Advice for Engineering Managers Who Want to Climb the Ladder: https://charity.wtf/2022/06/13/advice-for-engineering-manage...
- The Engineer/Manager Pendulum: https://charity.wtf/2017/05/11/the-engineer-manager-pendulum... - I LOVE this one, it's about how it's OK to have a career where you swing back and forth between engineering management and being an "IC".
The one on Generate AI seems a bit outdated. This was before Claude Code was released.
Most of that one still rings very true to me. I particularly liked this section:
> Let’s start here: hiring engineers is not a process of “picking the best person for the job”. Hiring engineers is about composing teams. The smallest unit of software ownership is not the individual, it’s the team. Only teams can own, build, and maintain a corpus of software. It is inherently a collaborative, cooperative activity.
I totally agree with this part.
Right now, we are in a transitioning phase, where parts of a team might reject the notion of using AI, while others might be using it wisely, and still others might be auto-creating PRs without checking the output. These misalignments are a big problem in my view, and it’s hard to know (for anybody involved) during hiring what the stance really is because the latter group is often not honest about it.
Terrific, thank you.
Honeycomb is inspired by Facebook's Scuba (https://research.facebook.com/publications/scuba-diving-into...). The paper is from 2013, predating honeycomb. Charity worked there as well, but presumably was not part of the initial implementation given the timing.
I've learned more from Charity about telemetry than from anyone else. Her book is great, as are her talks and blog posts. And Honeycomb, as a tool, is frankly pretty amazing
Yep, I'm a fan.
> They were the first to recognize the value of this type of observability
With all due respect to her great writing, I think there’s a mix of revisionist history blended with PR claims going on in this thread. The blog has some good reading, but let’s not get ahead of ourselves in rewriting history around this one person/company.
> I think there’s a mix of revisionist history blended with PR claims going on in this thread.
I can only speak for myself. I worked for a company that is somewhere in the observability space (Sentry) and Charity was a person I looked up to my entire time working on Sentry. Both for how she ran the company, for the design they picked and for the approaches they took. There might be others that have worked on wide events (afterall, Honeycomb is famously inspired by Facebook's scuba), she is for sure the voice that made it popular.
This post was so in-line with her writing that I was really expecting it to turn into an ad for Honeycomb at the end. I was pretty surprised with it turned out the author was unaffiliated!
Nick Blumhardt for a while longer than that as "structured logging". Seq and Serilog as enabling software and library in the .net ecosystem.
The article emphasizes that their recommendation is different from structured logging.
She has good content but no single person branded the term "observability", what the heck. You can respect someone without making wild claims.
While I agree with some of it, I feel like there's a big gotcha here that isn't addressed. Having 1 single wide event, at the end of a request, means that if something unexpected happens in the middle (stack overflow, some bug that throws an error that bypasses your logging system, lambda times out etc...) you don't get any visibility into what happens.
You also most likely lose out on a lot of logging frameworks your language has that your dependencies might use.
I would say this is a good layer to put on top of your regular logs. Make sure you have a request/session wide id and aggregate all those in your clickhouse or whatever into a single "log".
The way I have solved for this in my own framework in PHP is by having a Logging class with the following interface
interface LoggerInterface {
// calls $this->system(LEVEL_ERROR, ...);
public function exception(Throwable $e): void;
// Typical system logs
public function system(string $level, string $message, ?string $category = null, mixed ...$extra): void;
// User specific logs that can be seen in the user's "my history"
public function log(string $event, int|string|null $user_id = null, ?string $category = null, ?string $message = null, mixed ...$extra): void;
}
I also have a global exception handler that is registered at application bootstrap time that takes any exception that happens at runtime and runs $logger->exception($e);There is obviously a tiny bit more of boilerplating to this to ensure reliability, but it works so well that I can't live without it anymore. The logs are then inserted into a wide DB table with all the field one could ever want to examine thanks to the variadic parameter.
Nice. I guess you write logs on the "final" block of a global try/catch/final?
Something like:
try {
// handle request code
} catch (...) {
// add exceptions to log
} final {
// insert logs into DB
}I used to do it like that and it worked really well but I changed the flow to where exceptions are actually part of the control flow of the app using PHP's set_exception_handler(), set_error_handler() and register_shutdown_function().
Example, lets say a user forgot to provide a password when authenticating, then I will throw a ClientSideException(400, "need password yada yada");
That exception will bubble up to the exception_handler that logs and outputs the proper message to the screen. Similarly if ANY exception is thrown, regardless of where it originated, the same will happen.
When you embrace exceptions as control flow rather than try to avoid them, suddenly everything gets 10x easier and I end up writing much less code overall.
I love Exceptions as control flow! Thanks for the suggestion.
I too use specialized exceptions. Some have friendly messages that can be displayed to the user, like "Please fill the password". But critical exceptions display a generic error to the user ("Ops, sorry something went wrong on our side...") but log specifics to devs, like database connection errors, for example.
If that's an issue (visibility into middle layers) it just means your events aren't wide enough. There's no fundamental difference between log.error(data) and wide_event.attach(error, data), nor similar schemes using parameters rather that context/global-based state.
There are still use cases for one or the other strategy, but I'm not a fan of this explanation in either direction.
> If that's an issue (visibility into middle layers) it just means your events aren't wide enough.
I hate this kind of No-True-Scotsman handwaves for how a certain approach is supposed to solve my problems. "If brute-force search is not solving all your problems, it just means your EC2 servers are not beefy enough."
I gotta admit, I don't quite "get" TFA's point and the one issue that jumped out at me while reading it and your comment is that sooner than later your wide events just become fat, supposedly-still-human-readable JSON dumps.
I think a machine-parseable log-line format is still better than wide events, each line hopefully correlated with a request id though in practice I find that user id + time correlation isn't that bad either.
>> [TFA] Wide Event: A single, context-rich log event emitted per request per service. Instead of 13 log lines for one request, you emit 1 line with 50+ fields containing everything you might need to debug.
I am not convinced this is supposed to help the poor soul who has to debug an incident at 2AM. Take for example a function that has to watch out for a special kind of user (`isUserFoo`) where "special kind" is defined as a metric on five user attributes. I.e.,
lambda isUserFoo(u): u.isA && (u.isB || u.isC) && (u.isD || u.isE)
With usual logging I might find <timestamp> : <function> : {"level": "INFO", "requestID": "xxxaaa", "msg": "user is foo"}
Which immediately tells me that foo-ness is something I might want to pay attention to in this context.With wide events, as I understand it, either you log the user in the wide event dump with attributes A to E (and potentially more!) or coalesce these into a boolean field `isUserFoo`. None of which tells me that foo-ness might be something that might be relevant in this context.
Multiply that with all the possible special-cases any logging unit might have to deal with. There's bar-ness which is also dependent on attributes A-E but with different logical connectives. There's baz-ness which is `isUserFoo(u) XOR (217828 < u.zCount < 3141592)`. The wide event is soooo context-rich I'm drowning.
Your objection, as I understand it, is some combination of "no true Scotsman" combined with complaints about wide events themselves.
To the first point (no true Scotsman), I really don't think that applies in the slightest. The post I'm replying to said (paraphrasing) that middle-layer observability is hard with wide events and easy with logs. My counter is that the objection has nothing to do with wide events vs logs, since in both scenarios you can choose to include or omit more information with the same syntactic (and similar runtime overhead) ease. I think that's qualitatively different from other NTS arguments like TDD, in that their complaint is "I don't have enough information if I don't send it somewhere" and my objection is just "have you tried sending it somewhere?" There isn't an infinite stream of counter-arguments about holding the tool wrong; there's the very dumbest suggestion a rubber duck might possibly provide about their particular complaint, which is fully and easily compatible with wide events in every incarnation I've seen.
To the second point (wide events aren't especially useful and/or suck), I think a proper argument for them is a bit more nuanced (and I agree that they aren't a panacea and aren't without their drawbacks). I'll devote the rest of my (hopefully brief) comment to this idea.
1. Your counter-example falls prey to the same flaw as the post I responded to. If you want information then just send that information somewhere. Wide events don't stop you from gathering data you care about. If you need a requestID then it likely exists in the event already. If you need a message then _hopefully_ that's reasonably encoded in your choice of sub-field, and if it's not then you're free to tack on that sort of metadata as well.
2. Your next objection is about the wide event being so context-rich as to be a problem in its own right. I'm sympathetic to that issue, but normal logging doesn't avoid it. It takes exactly one production issue where you can't tie together related events (or else can tie them together but only via hacks which sometimes merge unrelated events with similar strings) for you to realize that completely disjoint log lines aren't exactly a safe fallback. If you have so much context-dependent complexity that a wide event is hard to interpret then linear logs are going to be a pain in the ass as well.
Mildly addressing the _actual_ pros and cons: Logs and wide events are both capapable of transmitting the same information. One reasonable frame of reference is viewing wide events as "pre-joined" with a side helping of compiler enforcement of the structure.
It's trivially possible to produce two log lines in unrelated parts of a code base which no possible parser can disambiguate. That's not (usually) possible when you have some data specification (your wide event) mediating the madness.
It's sometimes possible with normal logs to join on things which matter (as in your requestID example), but it's always possible with wide events since the relevant joins are executed by construction (also substantially more cheaply than a post-hoc join). Moreover, when you have to sub-sample, wide events give an easy strategy for ensuring your joins work (you sub-sample at a wide event level rather than a log-line level) -- it's not required; I've worked on systems with a "log seed" or whatever which manage that joinability in the face of sub-sampling, but it's more likely to "just work" with wide events.
The real argument in favor of wide events, IMO, is that it encourages returning information a caller is likely to care about at every level of the stack. You don't just get potentially slightly better logs; you're able to leverage the information in better tests and other hooks into the system in question. Parsing logs for every little one-off task sucks, and systems designed to be treated that way tend to suck and be nearly impossible to modify as desired if you actually have to interact with "logs" programatically.
It's still just one design choice in a sea of other tradeoffs (and one I'm only half-heartledly pursuing at $WORK since we definitely have some constraints which are solved by neither wide events nor traditional logging), BUT:
1. My argument against some random person's choice of counter-argument is perfectly sound. Nothing they said depended on wide events in the slightest, which was my core complaint, and I'm very mildly offended that anyone capable of writing something as otherwise sane and structured as your response would think otherwise.
2. Wide events do have a purpose, and your response doesn't seem to recognize that point in the design space. TFA wasn't the most enjoyable thing I've ever read, but I don't think the core ideas were that opaque, and I don't think a moment's consideration of carry-on implications would be out of the question either. I could be very wrong about the requisite background to understand the article or something, but I'm surprised to see responses of any nature which engage with irrelevant minutea rather than the subset of core benefits TFA chose to highlight (and I'm even more surprised to see anything in favor of or against wide events given my stated position that I care more about the faulty argument against them than whether they're good or bad)..
I wonder if one might solve this by using an accumulator that merges objects as they are emitted based on some ID (i.e. request ID say) and then ether emits the object on normal execution or a global exception handler emits it on error...?
I was going to say that. That definitely would be a solution (and ought to be the way it works).
Having logs in the format "connection X:Y accepted at Z ns for http request XXX" and then a "connection X:Y closed at Z ns for http response XXX" is rather nice when debugging issues on slow systems.