> I'll have to monitor more things (like system upgrades and intrusion attempts)
You very much should be monitoring / managing those things on AWS as well. For system upgrades, `unattended-upgrades` can keep security patches (or anything else if you'd like, but I wouldn't recommend that unless you have a canary instance) up to date for you. For kernel upgrades, historically it's reboots, though there have been a smattering of live update tools like kSplice, kGraft, and the latest addition from GEICO of all places, tuxtape [0].
> I'd also have to amortize parts and labor as part of the cost, which is going to push the price up.
Given the prices you laid out for AWS, it's not multi-AZ, but even single-AZ can of course failover with downtime. So I'll say you get 2U, with two individual servers, DBs either doing logical replication w/ failover, or something like DRBD [1] to present the two servers' storage as a single block device (you'd still need a failover mechanism for the DBs). So $400 for two 1U servers, and maybe $150/month at most for colo space. Even with the (IMO unrealistically low) $200/month quote for AWS, at 5 months, you're now saving $50/month. Re: parts and labor, luckily, parts for old servers is incredibly cheap. PC3-12800R 16GiB sticks are $10-12. CPUs are also stupidly cheap. Assuming Ivy Bridge era (yes, this is old, yes, it's still plenty fast for nearly any web app), even the fastest available (E5-2697v2) is $50 for a matched pair.
I don't say all of this just guessing; I run 3x Dell R620s along with 2x Supermicros in my homelab. My uptime for services is better than most places I've worked at (of course, I'm the only one doing work, I get that). They run 24/7/365, and in the ~5 years or so I've had these, the only trouble the Dells have given me is one bad PSU (each server has redundant PSUs, so no big deal), and a couple of bad sticks of RAM. One Supermicro has been slightly less reliable but to be fair, a. it has a hodgepodge of parts b. I modded its BIOS to allow NVMe booting, so it's not entirely SM's fault.
EDIT: re: backups in your other comment, run ZFS as your filesystem (for a variety of reasons), periodically snapshot, and then send those off-site to any number of block storage providers. Keep the last few days, with increasing granularity as you approach today, on the servers as well. If you need to roll back, it's incredibly fast to do so.
But you don't need comparable capacity, at least not at first. And when you do, you click some buttons or run terraform plan/apply. Absolutely it's going to cost more measured only by tech specs. But you're not paying primarily for tech specs, you're paying for somebody else to do the work. That's where the cost comparability really needs to be assessed.
Security in AWS is a thorny topic, I'll agree, but the risks are a little different. You need to secure your accounts and users, and lock out unneeded services while monitoring for unexpected service utilization. Honestly, I think for what you're paying, AWS should be doing more for you here (and they are improving albeit slowly). Hence maybe the real point of comparison ought to be against PaaS because then all of that is out of scope too, and I think such offerings are already putting pressure on AWS to offer more value.
Agreed.
> But you're not paying primarily for tech specs, you're paying for somebody else to do the work. ... Honestly, I think for what you're paying, AWS should be doing more for you here
Also agreed, and this is why I don't think the value proposition exists.
We can agree to disagree on which approach is better; I doubt there's an objective truth to be had.
This is why numbers do not stack up in the calculations – the premise that the DB has to be provisioned is not the correct one to start off with.
The right way of cooking RDS in AWS is to go serverless from the start and configure the number of DCU's, e.g. 1 to N. That way it will be even cheaper than the originally quoted $200.
Generally speaking, there is absolutely no need for anything to be provisioned at a fixed compute capacity in AWS unless there is a very specific or an edge case that, likewise, warrants a provisioned instance of something.
Nitpick, but there is no Serverless for RDS, only Aurora. The two are wildly different in their architecture and performance characteristics. Then there's RDS Multi-AZ Cluster, which is about as confusingly named as they could manage, but I digress.
Let's take your stated Minimum ACU of 1 as an example. That gives you 2 GiB of RAM, with "CPU and networking similar to what is available in provisioned Aurora instances." Since I can't find anything more specific, I'll compare it to a `t4g.small`, which has 2 vCPU (since it's ARM, it's actual cores, not threads), and 0.128 / 5.0 Gbps [0] baseline/burst network bandwidth, which is 8 / 625 MBps. That burst is best-effort, and also only lasts for 5 – 60 minutes [1] "depending on instance size." Since this is tiny, I'm going to assume the low end of that scale. Also, since this is Aurora, we have to account for both [2] client <--> DB and DB-compute (each node, if more than one) <--> DB-storage bandwidth. Aurora Serverless v2 is $0.12/hour, or $87.60/month, plus storage, bandwidth, and I/O costs.
So we have a Postgres-compatible DB with 2 CPUs, 2 GiB of RAM, and 64 Mbps of baseline network bandwidth that's shared between application queries and the cluster volume. Since Aurora doesn't use the OS page cache, its `shared_buffers` will be set to ~75% of RAM, or 1.5 GiB. Memory will also be consumed by the various processes, like the WAL writer, background writer, auto-vacuum daemon, and of course, each connection spawns a process. For the latter reason, unless you're operating at toy scale (single-digit connections at any given time), you need some kind of connection pooler with Postgres. Keeping in the spirit of letting AWS do everything, they have RDS Proxy, which despite the name, also works with Aurora. That's $0.015/ACU-hour, with a minimum 8 ACUs for Aurora Serverless, or $87.60/month.
Now, you could of course just let Aurora scale up in response to network utilization, and skip RDS Proxy. You'll eventually bottleneck / it won't make any financial sense, but you could. I have no idea how to model that pricing, since it depends on so many factors.
I went on about network bandwidth so much because it catches people by surprise, especially with Aurora, and doubly so with Postgres for many services. The reason is its WAL amplification from full page writes [3]. If you have a UUIDv4 (or anything else non-k-sortable) PK, the B+tree is getting thrashed constantly, leading to slower performance on reads and writes. Aurora doesn't suffer from the full page writes problem (that's still worth reading about and understanding), but it does still have the same problems with index thrashing, and it also has the same issues as Postgres with Heap-Only Tuple updates [4]. Unless you've carefully designed your schema around this, it's going to impact you, and you'll have more network traffic than you expected. Add to that dev's love of chucking everything into JSON[B] columns, and the tuples are going to be quite large.
Anyway, I threw together an estimate [5] with just Aurora (1 ACU, no RDS Proxy, modest I/O), 2x ALBs with an absurdly low consumption, and 2x ECS tasks. It came out to $232.52/month.
[0]: https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html...
[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
[2]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...
[3]: https://www.rockdata.net/tutorial/tune-full-page-writes/
[4]: https://www.postgresql.org/docs/current/storage-hot.html
[5]: https://calculator.aws/#/estimate?id=8972061e6386602efdc2844...
Aurora is actually not a database but is a scalable storage layer that operates over the network and is decoupled from the query engine (compute). The architecture has been used to implement vastly different query engines on top of it (PgSQL, MySQL, DocumentDB – a MongoDB alternative, and Neptune – a property graph database / triple store).
The closest abstraction I can think of to describe Aurora is a VAX/VMS cluster – where the consumer sees a single entity, regardless of size, whilst the scaling (out or back in) remains entirely opaque.
Aurora does not support RDS Proxy for PostgreSQL or its equivalents for other query engine types because it addresses cluster access through cluster endpoints. There are two types of endpoints: one for read-only queries («reader endpoints» in Aurora parlance) and one for read-mutate queries («writer endpoint»). Aurora supports up to 15 reader endpoints, but there can be only one writer endpoint.
Reader endpoints improve the performance of non-mutating queries by distributing the load across read replicas. The default Aurora cluster endpoint always points to the writer instance. Consumers can either default to the writer endpoint for all queries or segregate non-mutating queries to reader endpoints for faster execution.
This behaviour is consistent across all supported query engines, such as PostgreSQL, Neptune, and DocumentDB.
I do not think it is correct to state that Aurora does not use the OS page cache – it does, as there is still a server with an operating system somewhere, despite the «serverless» moniker. In fact, due to its layered distributed architecture, there is now more than one OS page cache, as described in [0].
Since Aurora is only accessible over the network, it introduces unique peculiarities where the standard provisions of storage being local do not apply.
Now, onto the subject of costs. A couple of years ago, an internal client who ran provisioned RDS clusters in three environments (dev, uat, and prod) reached out to me with a request to create infrastructure clones of all three clusters. After analysing their data access patterns, peak times, and other relevant performance metrics, I figured that they did not need provisioned RDS and would benefit from Aurora Serverless instead – which is exactly what they got (unbeknownst to them, which I consider another net positive for Aurora). The dev and uat environments were configured with lower upper ACU's, whilst production had a higher upper ACU configuration, as expected.
Switching to Aurora Serverless resulted in a 30% reduction in the monthly bill for the dev and uat environments right off the bat and nearly a 50% reduction in production costs compared to a provisioned RDS cluster of the same capacity (if we use the upper ACU value as the ceiling). No code changes were required, and the transition was seamless.
Ironically, I have discovered that the AWS cost calculator consistently overestimates the projected costs, and the actual monthly costs are consistently lower. The cost calculator provides a rough estimate, which is highly useful for presenting the solution cost estimate to FinOps or executives. Unintentionally, it also offers an opportunity to revisit the same individuals later and inform them that the actual costs are lower. It is quite amusing.
[0] https://muratbuffalo.blogspot.com/2024/07/understanding-perf...
They call it [0] a database engine, and go on to say "Aurora includes a high-performance storage subsystem.":
> "Amazon Aurora (Aurora) is a fully managed relational database engine that's compatible with MySQL and PostgreSQL."
To your point re: part of RDS, though, they do say that it's "part of RDS."
> The architecture has been used to implement vastly different query engines on top of it (PgSQL, MySQL, DocumentDB – a MongoDB alternative, and Neptune – a property graph database / triple store).
Do you have a source for this? That's new information to me.
> Aurora does not support RDS Proxy for PostgreSQL
Yes it does [1].
> I do not think it is correct to state that Aurora does not use the OS page cache – it does
It does not [2]:
> "Conversely, in Amazon Aurora PostgreSQL, the default value [for shared_buffers] is derived from the formula SUM(DBInstanceClassMemory/12038, -50003). This difference stems from the fact that Amazon Aurora PostgreSQL does not depend on the operating system for data caching." [emphasis mine]
Even without that explicit statement, you could infer it from the fact that the default value for `effective_cache_size` in Aurora Postgres is the same as that of `shared_buffers`, the formula given above.
> Switching to Aurora Serverless resulted in a 30% reduction in the monthly bill for the dev and uat environments right off the bat
Agreed, for lower-traffic clusters you can probably realize savings by doing this. However, it's also likely that for Dev/Stage/UAT environments, you could achieve the same or greater via an EventBridge rule that starts/stops the cluster such that it isn't running overnight (assuming the company doesn't have a globally distributed workforce).
What bothers me most about Aurora's pricing model is charging for I/O. And yes, I know they have an alternative pricing model that doesn't do so (but the baseline is of course higher); it's the principal of the thing. The amortized cost of wear to disks should be baked into the price. It would be difficult for a skilled DBA with plenty of Linux experience to accurately estimate how many I/O a given query might take. In a vacuum for a cold cache, it's not that bad: estimate or look up statistics for row sizes, determine if any predicates can use an index (and if so, the correlation of the column[s]), estimate index selectivity, if any, confirm expected disk block size vs. Postgres page size, and make an educated guess. If you add any concurrent queries that may be altering the tuples you're viewing, it's now much harder. If you then add a distributed storage layer, which I assume attempts to boxcar data blocks for transmission much like EBS does, it's nearly impossible. Now try doing that if you're a "cloud native" type who hasn't the faintest idea what blktrace [3] is.
[0]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...
[1]: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...
[2]: https://aws.amazon.com/blogs/database/determining-the-optima...
My personal AWS account is stuffed with globally distributed multi-region, multi-az, fault tolerant, hugely scalable things that rarely get used. By “rarely” I mean requests per hour or minute, not second.
The sum total CPU utilization would be negligible. And if I ran instances across the 30+ AZs I’d be broke.
The service based approach (aka event driven) has some real magic at the low end of usage where experimentation and innovation happens.