Story Detail of id 47677706 | Liveview Hacker News

evil-olive1 day ago | on: Running out of disk space in production

> Surely a 50% warning alarm on disk usage covers this without manual intervention?

surely you don't need a fire extinguisher in your kitchen, if you have a smoke detector?

a "warning alarm" is a terrible concept, in general. it's a perfect way to lead to alert fatigue.

over time, you're likely to have someone silence the alarm because there's some host sitting at 57% disk usage for totally normal reasons and they're tired of getting spammed about it.

even well-tuned alert rules (ones that predict growth over time rather than only looking at the current value) tend to be targeted towards catching relatively "slow" leaks of disk usage.

there is always the possibility for a "fast" disk space consumer to fill up the disk more quickly than your alerting system can bring it to your attention and you can fix it. at the extreme end, for example, a standard EBS volume has a throughput of 125mb/sec. something that saturates that limit will fill up 10gb of free space in 80 seconds.

ssl-314 hours ago | parent | next

50% is probably unrealistic. Nobody really wants to diminish their storage by 50%.

Let's set a fixed threshold -- 100GB, say -- and play out both methods.

Method A: One or more ballast files are created, totalling 100GB. The machine runs out of storage and grinds to a halt. Hopefully someone notices soon or gets a generic alert that it has ceased, remembers that there's ballast files, and deletes one or more of them. They then poke it with a stick and get it going again, and set forth to resolve whatever was causing the no-storage condition (adding disk, cleaning trash, or whatever).

Method B: A specific alert that triggers with <100GB of free space. Someone sees this alert, understands what it means (because it is descriptive instead of generic), and logs in to resolve the low-storage condition (however that is done -- same as Method A). There is no stick-poking.

Method C: The control. We do nothing, and run out of space. Panic ensues. Articles are written.

---

Both A and B methods have an equal number of alerts for each low-disk condition (<100GB). Both methods work, in that they can form the impetus to free up some space.

But Method A relies on a system to crash, while Method B does not rely upon a crash at all.

I think that the lack of crash makes Method B rather superior all on its own.

(Method C sucks.)

loading story #47686272

majormajor12 hours ago | parent

How does the ballast file prevent extreme runaway? You ain't gonna notice and delete it that quickly.

loading story #47686279

#visit	13,259,553
#session	74,665
#live-session	0