WebSockets cost us $1M on our AWS bill

https://www.recall.ai/post/how-websockets-cost-us-1m-on-our-aws-bill

237tosh | 6 hours ago | 147 | HN

loading story #42069673

loading story #42069745

Is this really an AWS issue? Sounds like you were just burning CPU cycles, which is not AWS related. WebSockets makes it sound like it was a data transfer or API gateway cost.

loading story #42068890

loading story #42068522

trollied4 hours ago | parent | next

>In a typical TCP/IP network connected via ethernet, the standard MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP MSS (Maximum Segment Size) of 1448 bytes. This is much smaller than our 3MB+ raw video frames.

> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.

Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.

loading story #42071609

loading story #42069768

handfuloflight6 hours ago | parent | next

Love the transparency here. Would also love if the same transparency was applied to pricing for their core product. Doesn't appear anywhere on the site.

loading story #42069532

lawrenceduk6 hours ago | parent

It’s ok, it’s now a million dollars/year cheaper when your renewal comes up!

Jokes aside though, some good performance sleuthing there.

marcopolo4 hours ago | parent | next

Masking in the WebSocket protocol is kind of a funny and sad fix to the problem of intermediaries trying to be smart and helpful, but failing miserably.

The linked section of the RFC is worth the read: https://www.rfc-editor.org/rfc/rfc6455#section-10.3

pier254 hours ago | parent | next

Why were they using websockets to send video in the first place?

Was it because they didn't want to use some multicast video server?

loading story #42069879

cogman105 hours ago | parent | next

This is such a weird way to do things.

Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)

At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.

Just weird.

loading story #42068622

loading story #42069256

loading story #42068813

cosmotic6 hours ago | parent | next

Why decode to then turn around and re-encode?

pavlov5 hours ago | parent | next

Reading their product page, it seems like Recall captures meetings on whatever platform their customers are using: Zoom, Teams, Google Meet, etc.

Since they don't have API access to all these platforms, the best they can do to capture the A/V streams is simply to join the meeting in a headless browser on a server, then capture the browser's output and re-encode it.

loading story #42068689

ketzo6 hours ago | parent | next

I had the same question, but I imagine that the "media pipeline" box with a line that goes directly from "compositor" to "encoder" is probably hiding quite a lot of complexity

Recall's offering allows you to get "audio, video, transcripts, and metadata" from video calls -- again, total conjecture, but I imagine they do need to decode into raw format in order to split out all these end-products (and then re-encode for a video recording specifically.)

Szpadel5 hours ago | parent

my guess is either that video they get use some proprietary encoding format (js might do some magic on the feed) or it's because it's latency optimized stream that consumes a lot of bandwidth

loading story #42071056

a_t486 hours ago | parent | next

Did they consider iceoryx2? From the outside, it feels like it fits the bill.

Dylan168075 hours ago | parent | next

The title makes it sound like there was some kind of blowout, but really it was a tool that wasn't the best fit for this job, and they were using twice as much CPU as necessary, nothing crazy.

akira25015 hours ago | parent | next

> A single 1080p raw video frame would be 1080 * 1920 * 1.5 = 3110.4 KB in size

They seem to not understand the fundamentals of what they're working on.

> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.

You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?

> However there's no standard interface for transporting data over shared memory.

Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.

> Instead of the typical two-pointers, we have three pointers in our ring buffer:

You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.

loading story #42068564

loading story #42069413

loading story #42068966

ComputerGuru6 hours ago | parent | next

I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development. Quite an expensive lesson for them to learn, even though I'm assuming they do have the talent somewhere on the team if they're able to maintain a fork of Chromium.

(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)

I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).

SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).

diroussel5 hours ago | parent | next

Sometimes it is more important to work on proving you have a viable product and market to sell it in before you optimise.

On the outside we can’t be sure. But it’s possible that they took the right decision to go with a naïve implementation first. Then profile, measure and improve later.

But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess that’s just the world we are in.

CharlieDigital5 hours ago | parent | next

    > I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development

Based on their job listing[0], Recall is using Rust on the backend.

[0] https://www.workatastartup.com/companies/recall-ai

Sesse__5 hours ago | parent | next

It's not even clear why they need a browser in the mix; most of these services have APIs you can use. (Also, why fork Chromium instead of using CEF?)

loading story #42068545

whatever15 hours ago | parent

Wouldn’t also something like redis be an alternative?

loading story #42070991

loading story #42069585

bauruine4 hours ago | parent | next

FWIW: The MTU of the loopback interface on Linux is 64KB by default

CyberDildonics5 hours ago | parent | next

Actual reality beyond the fake title:

"using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"

then

"and the quest for an efficient high-bandwidth, low-latency IPC"

Shared memory. It has been there for 50 years.

5 hours ago | parent | next

{"deleted":true,"id":42068330,"parent":42067275,"time":1730922863,"type":"comment"}

londons_explore5 hours ago | parent | next

They are presumably using the GPU for video encoding....

And the GPU for rendering...

So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.

[1]: https://source.chromium.org/chromium/chromium/src/+/main:cc/...

loading story #42068576

loading story #42068566

loading story #42068541

loading story #42071180

OptionOfT5 hours ago | parent | next

Did they originally NOT run things on the same machine? Otherwise the WebSocket would be local and incur no cost.

nemothekid5 hours ago | parent | next

>WebSocket would be local and incur no cost.

The memcopys are the cost that they were paying, even if it was local.

magamanlegends5 hours ago | parent | next

our websocket traffic is roughly 40% of recall.ai and our bill was $150 USD this month using a high memory VPS

jgauth5 hours ago | parent

Did you read the article? It is about the CPU cost of using WebSockets to transfer data over loopback.

loading story #42068586

beoberha4 hours ago | parent | next

Classic Hacker News getting hung up on the narrative framing. It’s a cool investigation! Nice work guys!

dbrower5 hours ago | parent | next

How much did the engineering time to make this optimization cost?

renewiltord5 hours ago | parent | next

That's a good write-up with a standard solution in some other spaces. Shared memory buffers are very fast too. It's interesting to see them being used here. Nice write up. It wasn't what I expected: that they were doing something dumb with API Gateway Websockets. This is actual stuff. Nice.

loading story #42069854

jazzyjackson4 hours ago | parent | next

I for one would like to praise the company for sharing their failure, hopefully next time someone Googles "transport video over websocket" theyll find this thread.

5 hours ago | parent | next

{"deleted":true,"id":42068457,"parent":42067275,"time":1730923391,"type":"comment"}

hipadev236 hours ago | parent | next

what was the actual cost? cpu?

loading story #42068673

jgalt2125 hours ago | parent | next

> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.

As a point of comparison, how many TB per second of video does Netflix stream?

loading story #42068723

yapyap5 hours ago | parent | next

> But it turns out that if you IPC 1TB of video per second on AWS it can result in enormous bills when done inefficiently.

that’s surprising to.. almost no one? 1TBPS is nothing to scoff at

loading story #42068504

thadk5 hours ago | parent | next

Could Arrow be a part of the shared memory solution in another context?

hkgjjgjfjfjfjf5 hours ago | parent | next

[dead]

punduk5 hours ago | parent | next

[flagged]

apitman3 hours ago | parent

I've been toying around with a design for a real-time chat protocol, and was recently in a debate of WebSockets vs HTTP long polling. This should give me some nice ammunition.

pavlov3 hours ago | parent

No, this story is about interprocess communication on a single computer, it has practically nothing to do with WebSockets vs something else over an IP network.

#visit	10445236
#session	44657
#live-session	1