Story Detail of id 48278856 | Liveview Hacker News

jaapz9 hours ago | on: Incident with Actions and Pages

All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.

logifail7 hours ago | parent | next

> All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.

Is it true that official service status pages are updated automatically?

loading story #48280763

hnlmorg9 hours ago | parent | next

You'd expect them to be monitoring more than just the HTTP response codes from user requests for precisely this reason.

If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.

But effective monitoring is harder than people assume.

loading story #48280197

loading story #48282807

echelon9 hours ago | parent | next

In a high performance service with good maintenance and upkeep, you page for all 500s. A noisy pager forces the team to fix the 500s.

Maybe the Github Actions infrastructure isn't run like that.

edit: my oncall rotation notified on all 500s, 24/7, not just rates - https://news.ycombinator.com/item?id=48279262

Doohickey-d9 hours ago | root | parent | next

Im curious about this: because in my experience (working on smaller services though), a small number of errors is always there, as a "baseline".

Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"

Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.

bobthepanda7 hours ago | root | parent | next

It’s where monitoring for 9s is more important at that scale than absolute errors. So long as degradation is graceful or retried it should not be a massive problem.

It does require constant tuning and adjustment though.

KPGv29 hours ago | root | parent

Bitflips are something that can happen in consumer-grade RAM, so that tracks (and it's comforting that wayward cosmic rays are a substantial reason for an application's crashes!), but on enterprise servers, they will run ECC RAM that is very resistant to bit flips.

This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.

I wouldn't expect bit flips to be a significant contributor to enterprise problems.

loading story #48279765

loading story #48279775

TheDong9 hours ago | root | parent | next

Do you know of a single service at a single company that actually does that?

I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.

I know none of those are particularly "high performance" though. Curious where your experience is coming from.

CBLT9 hours ago | root | parent | next

I've been oncall for a different G service that nearly paged on every error. It used the standard error budget tooling, but on hundreds of user buckets because the engineering around locality-specific configuration was... suspect. Many of these buckets had single-digits user. If a user was on their phone and lost signal, I was paged. Very poor oncall experience.

theta_d8 hours ago | root | parent | next

The sub-service at IBM cloud I worked on had an insanely small error budget such that pages were nearly constant. On call was hell week until a few of us insisted on fixing the issues. The "few" of us were contractors. The employees seemed more than willing to just let the pages continue.

loading story #48282257

echelon9 hours ago | root | parent

I worked at a large fintech moving billions of dollars in volume a day.

I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.

We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.

Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.

Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.

It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.

We didn't have "business hours"-only paging either as our platform was available globally, including a heavy install base in Asia.

sunrunner8 hours ago | root | parent

> We paged for every single 500.

Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?

loading story #48280053

loading story #48280447

compumike8 hours ago | root | parent | next

Re: "page for all 500s": there's a world of difference between "page me with a critical alert at 3am" and "notify me on Monday morning when my normal workday starts". At the extremes:

If my DB health check endpoint is returning 500s for N consecutive checks over M minutes, yeah, please wake me up at 3am!

If one user hit a weird edge case in form validation and got a one-off 500, please don't! We can fix that on Monday.

Not always easy to distinguish those clearly or configure those business hours rules, but for my team at https://heyoncall.com/ that is the goal -- otherwise your team burns out fast. Waking up someone at 3am has a real cost, so you better be sure it's worth it.

wasmitnetzen8 hours ago | root | parent

Shouldn't Github be large enough to not have anyone on-call, but just rotate the responsible team around the world?

loading story #48282201

loading story #48280467

awithrow9 hours ago | root | parent | next

that is absolutely not the case for any system of size and scale. that would just burn out the on-call team and not result in improvements. Error rates/budgets are used instead.

hnlmorg8 hours ago | root | parent

It depends what you're monitoring. If it's response codes from user generated queries, then I'd agree with you.

But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.

hvb28 hours ago | root | parent | next

> A noisy pager forces the team to fix the 500s.

I'm sure you're not in ops. Or in a dev org of a service with decent request rates.

What you're asking for is a service to fail silently. There's no way a service with a decent request rate to have 0 500s. Not when it still sees development.

A 50 year old bank API? Maybe...

rhyperior8 hours ago | root | parent | next

You only do this when you’re trying to use incident management as a hammer to make a point to somebody whom you have otherwise failed to convince to fix something through persuasive argument. Ie, it’s punitive.

swiftcoder8 hours ago | root | parent | next

Yeah, no, nobody runs cloud services like that. At AWS most alarms required failures in 3 consecutive 5 minute periods. Critical things could be on 3 consecutive 1 minute windows - but that alarm starts a 15 minute escalation for the oncall engineer to check in, and they have to validate the issue isn't a false alarm before updating the status page would even be considered

jordemort9 hours ago | root | parent

forget it, Jake; it’s Azure

registeredcorn6 hours ago | parent

I'm not arguing with what you're saying, but it does make me wonder: What exactly is the point of the status page, if "it is normal for users to already see errors before GitHub officially counts it as an outage"?

Is it more so to have something to link to for managers who aren't using the service have a pretty bar to look at and feel like they are "doing something"? Or is it more of a kind of a way to prevent confirming what you already suspect to be true. E.g. "Huh. Me and Jim are seeing problems. How about you Tom? Oh wait, crud. The service page is confirming it's down now. Never mind! Who wants coffee?!"

loading story #48281589

#visit	13,395,547
#session	74,665
#live-session	0