WebSockets cost us $1M on our AWS bill
https://www.recall.ai/post/how-websockets-cost-us-1m-on-our-aws-bill> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.
Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.
Jokes aside though, some good performance sleuthing there.
The linked section of the RFC is worth the read: https://www.rfc-editor.org/rfc/rfc6455#section-10.3
Was it because they didn't want to use some multicast video server?
Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)
At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.
Just weird.
Since they don't have API access to all these platforms, the best they can do to capture the A/V streams is simply to join the meeting in a headless browser on a server, then capture the browser's output and re-encode it.
Recall's offering allows you to get "audio, video, transcripts, and metadata" from video calls -- again, total conjecture, but I imagine they do need to decode into raw format in order to split out all these end-products (and then re-encode for a video recording specifically.)
They seem to not understand the fundamentals of what they're working on.
> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.
You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?
> However there's no standard interface for transporting data over shared memory.
Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.
> Instead of the typical two-pointers, we have three pointers in our ring buffer:
You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.
(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)
I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).
SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).
On the outside we can’t be sure. But it’s possible that they took the right decision to go with a naïve implementation first. Then profile, measure and improve later.
But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess that’s just the world we are in.
> I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development
Based on their job listing[0], Recall is using Rust on the backend."using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"
then
"and the quest for an efficient high-bandwidth, low-latency IPC"
Shared memory. It has been there for 50 years.
And the GPU for rendering...
So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.
[1]: https://source.chromium.org/chromium/chromium/src/+/main:cc/...
The memcopys are the cost that they were paying, even if it was local.
As a point of comparison, how many TB per second of video does Netflix stream?
that’s surprising to.. almost no one? 1TBPS is nothing to scoff at