Hacker News new | past | comments | ask | show | jobs | submit

WebSockets cost us $1M on our AWS bill

https://www.recall.ai/post/how-websockets-cost-us-1m-on-our-aws-bill
loading story #42069673
loading story #42069745
Is this really an AWS issue? Sounds like you were just burning CPU cycles, which is not AWS related. WebSockets makes it sound like it was a data transfer or API gateway cost.
loading story #42068890
loading story #42068522
>In a typical TCP/IP network connected via ethernet, the standard MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP MSS (Maximum Segment Size) of 1448 bytes. This is much smaller than our 3MB+ raw video frames.

> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.

Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.

loading story #42071609
loading story #42069768
Love the transparency here. Would also love if the same transparency was applied to pricing for their core product. Doesn't appear anywhere on the site.
loading story #42069532
It’s ok, it’s now a million dollars/year cheaper when your renewal comes up!

Jokes aside though, some good performance sleuthing there.

Masking in the WebSocket protocol is kind of a funny and sad fix to the problem of intermediaries trying to be smart and helpful, but failing miserably.

The linked section of the RFC is worth the read: https://www.rfc-editor.org/rfc/rfc6455#section-10.3

Why were they using websockets to send video in the first place?

Was it because they didn't want to use some multicast video server?

loading story #42069879
This is such a weird way to do things.

Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)

At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.

Just weird.

loading story #42068622
loading story #42069256
loading story #42068813
Why decode to then turn around and re-encode?
Reading their product page, it seems like Recall captures meetings on whatever platform their customers are using: Zoom, Teams, Google Meet, etc.

Since they don't have API access to all these platforms, the best they can do to capture the A/V streams is simply to join the meeting in a headless browser on a server, then capture the browser's output and re-encode it.

loading story #42068689
I had the same question, but I imagine that the "media pipeline" box with a line that goes directly from "compositor" to "encoder" is probably hiding quite a lot of complexity

Recall's offering allows you to get "audio, video, transcripts, and metadata" from video calls -- again, total conjecture, but I imagine they do need to decode into raw format in order to split out all these end-products (and then re-encode for a video recording specifically.)

my guess is either that video they get use some proprietary encoding format (js might do some magic on the feed) or it's because it's latency optimized stream that consumes a lot of bandwidth
loading story #42071056
Did they consider iceoryx2? From the outside, it feels like it fits the bill.
The title makes it sound like there was some kind of blowout, but really it was a tool that wasn't the best fit for this job, and they were using twice as much CPU as necessary, nothing crazy.
> A single 1080p raw video frame would be 1080 * 1920 * 1.5 = 3110.4 KB in size

They seem to not understand the fundamentals of what they're working on.

> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.

You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?

> However there's no standard interface for transporting data over shared memory.

Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.

> Instead of the typical two-pointers, we have three pointers in our ring buffer:

You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.

loading story #42068564
loading story #42069413
loading story #42068966
I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development. Quite an expensive lesson for them to learn, even though I'm assuming they do have the talent somewhere on the team if they're able to maintain a fork of Chromium.

(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)

I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).

SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).

Sometimes it is more important to work on proving you have a viable product and market to sell it in before you optimise.

On the outside we can’t be sure. But it’s possible that they took the right decision to go with a naïve implementation first. Then profile, measure and improve later.

But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess that’s just the world we are in.

    > I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development
Based on their job listing[0], Recall is using Rust on the backend.

[0] https://www.workatastartup.com/companies/recall-ai

It's not even clear why they need a browser in the mix; most of these services have APIs you can use. (Also, why fork Chromium instead of using CEF?)
loading story #42068545
Wouldn’t also something like redis be an alternative?
loading story #42070991
loading story #42069585
FWIW: The MTU of the loopback interface on Linux is 64KB by default
Actual reality beyond the fake title:

"using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"

then

"and the quest for an efficient high-bandwidth, low-latency IPC"

Shared memory. It has been there for 50 years.

{"deleted":true,"id":42068330,"parent":42067275,"time":1730922863,"type":"comment"}
They are presumably using the GPU for video encoding....

And the GPU for rendering...

So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.

[1]: https://source.chromium.org/chromium/chromium/src/+/main:cc/...

loading story #42068576
loading story #42068566
loading story #42068541
loading story #42071180
Did they originally NOT run things on the same machine? Otherwise the WebSocket would be local and incur no cost.
>WebSocket would be local and incur no cost.

The memcopys are the cost that they were paying, even if it was local.

our websocket traffic is roughly 40% of recall.ai and our bill was $150 USD this month using a high memory VPS
Did you read the article? It is about the CPU cost of using WebSockets to transfer data over loopback.
loading story #42068586
Classic Hacker News getting hung up on the narrative framing. It’s a cool investigation! Nice work guys!
How much did the engineering time to make this optimization cost?
That's a good write-up with a standard solution in some other spaces. Shared memory buffers are very fast too. It's interesting to see them being used here. Nice write up. It wasn't what I expected: that they were doing something dumb with API Gateway Websockets. This is actual stuff. Nice.
loading story #42069854
I for one would like to praise the company for sharing their failure, hopefully next time someone Googles "transport video over websocket" theyll find this thread.
{"deleted":true,"id":42068457,"parent":42067275,"time":1730923391,"type":"comment"}
what was the actual cost? cpu?
loading story #42068673
> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.

As a point of comparison, how many TB per second of video does Netflix stream?

loading story #42068723
> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.

that’s surprising to.. almost no one? 1TBPS is nothing to scoff at

loading story #42068504
Could Arrow be a part of the shared memory solution in another context?
I've been toying around with a design for a real-time chat protocol, and was recently in a debate of WebSockets vs HTTP long polling. This should give me some nice ammunition.
No, this story is about interprocess communication on a single computer, it has practically nothing to do with WebSockets vs something else over an IP network.