What we learned from a 22-Day storage bug (and how we fixed it)
https://www.mux.com/blog/22-day-storage-bug> During this incident, we discovered we had crossed a scale threshold where our log ingestion pipeline was being rate-limited and quietly discarding logs. Ironically, we ended up with less information as a result, which made it significantly harder to reconstruct what was actually happening.
Last year they posted about using New Relic, Datadog, and Grafana. Would this ‘silent deletion of log data due to quota’ problem be characteristic of any one of them in particular, or is it something we have to watch out for with all of them?
loading story #47366367
loading story #47367192
“We didn’t handle errors, didn’t have logs, and now we do cuz next time” saved you a few mins
[dead]