Helix: A vision-language-action model for generalist humanoid control

https://www.figure.ai/news/helix

286Philpax | 1 day ago | 162 | HN

It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?

vessenes22 hours ago | parent

I was reading their site, and I too have some questions about this architecture.

I'd be very interested to see what the output of their 'big model' is that feeds into the small model. I presume the small model gets a bunch of environmental input, and some input from the big model, and we know that the big model input only updates every 30 or 40 frames in terms of small model.

Like, do they just output random control tokens from big model and embed those in small model and do gradient descent to find a good control 'language'? Do they train the small model on english tokens and have the big model output those? Custom coordinates tokens? (probably). Lots of interesting possibilities here.

By the way, the dataset they describe was generated by a large (much larger presumably) vision model tasked with creating tasks from successful videos.

So the pipeline is:

* Video of robot doing something

* (o1 or some other high end model) "describe very precisely the task the robot was given"

* o1 output -> 7B model -> small model -> loss

loading story #43117060

causal1 day ago | parent | next

I'm always wondering at the safety measures on these things. How much force is in those motors?

This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.

loading story #43128272

silentwanderer1 day ago | parent | next

In terms of low-level safety, they can probably back out forces on the robot from current or torque measurement and detect collisions. The challenge comes with faster motions carrying lots of inertia and behavioral safety (e.g. don't pour oil on the stove)

mmh00001 day ago | parent | next

The thing in the video moves slower than the sloth in Zootopia. If you die by that robot, you probably deserve it.

throwaway0123_523 hours ago | root | parent | next

As a sibling comment implies though, there's also danger from it being stupid while unsupervised. For example, I'd be very nervous having it do something autonomously in my kitchen for fear of it burning down my house by accident.

mikehollinger23 hours ago | root | parent | next

From a different robot (Boston Dynamics' new Atlas) - the system moves at a "reasonable" speed. But watch at 1m20s in this video[1]. You can see it bump and then move VERY quickly -- with speed that would certainly damage something, or hurt someone.

[1] https://www.youtube.com/watch?v=F_7IPm7f1vI

loading story #43121986

dr_kiszonka19 hours ago | root | parent | next

They are designed to penetrate Holtzman shields, surely.

causal23 hours ago | root | parent | next

Are you saying it cannot move faster than they because of some kind of governor?

Symmetry23 hours ago | root | parent | next

A governor, the firmware in the motor controllers, something like that. Certainly not the neural network though.

UltraSane17 hours ago | root | parent

That is how I would design it. It is common in safety critical PLC systems to have 1 or more separate safety PLCs that try to prevent bad things from happening.

idiotsecant6 hours ago | root | parent

Although in a SIL safety system the dangerous events are identified and extremely thoroughly characterized as part of system design.

There cannot be a safety system of this type for a generalist platform like a humanoid robot. It's possibility space is just too high.

I think the safety governor in this case would have to be a neural network that is at least as complex as the robots network, if not more so.

Which begs the question: what system checks that one for safety?

exe3423 hours ago | root | parent

or if you're old, injured, groggy from medication, distracted by something/someone else, blind, deaf or any number of things.

it's easy to take your able body for granted, but reality comes to meet all of us eventually.

UltraSane17 hours ago | parent | next

You can have dedicated controllers for the motors that limit their max torque.

loading story #43125170

cess111 day ago | parent

Not a big deal on the battlefield.

loading story #43118425

loading story #43117993

Symmetry1 day ago | parent | next

So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?

wwwtyro1 day ago | parent | next

Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.

camjw1 day ago | parent | next

Maybe I don't understand exactly what you're describing but why would anyone pay for this? When I bring home the shopping I just... chuck stuff in the cupboards. I already know where it all goes. Maybe you can explain more?

loading story #43116687

__MatrixMan__1 day ago | root | parent | next

It would be nice to be able to select a recipe and have it populate your shopping list based on what is currently in your cupboards. If you just chuck stuff in the cupboards then you have to be home to know what they contain.

Or you could wear it while you cook and it could give you nutrition information for whatever it is you cooked. Armed with that it could make recommendations about what nutrients you're likely deficient in based on your recent meals and suggest recipes to remedy the gap--recipes based on what it knows is already in the cupboard.

loading story #43117701

mistercheph23 hours ago | root | parent

[flagged]

luma1 day ago | root | parent | next

> why would anyone pay for this?

Presumably, they won't as this is still a tech demo. One can take this simple demonstration and think about some future use cases that aren't too different. How far away is something that'll do the dishes, cook a meal, or fold the laundry, etc? That's a very different value prop, and one that might attract a few buyers.

Philip-J-Fry1 day ago | root | parent

The person you're replying to is referring to the GP. The GP asks for an AI that tells them where to put their shopping. Why would anyone pay for THAT? Since we already know where everything goes without needing an AI to tell us. An AI isn't going to speed that up.

loading story #43117521

loading story #43116314

loading story #43116478

loading story #43118482

loading story #43116384

RedNifre23 hours ago | parent | next

I fully agree, building something like this is somewhere in my back log.

I think the key point why this "reverse cyborg" idea is not as dystopian as, say, being a worker drone in a large warehouse where the AI does not let you go to the toilet is that the AI is under your own control, so you decide on the high level goal "sort the stuff away", the AI does the intermediate planning and you do the execution.

We already have systems like that, every time you use you tell your navi where you want to go, it plans the route and gives you primitive commands like "on the next intersection, turn right", so why not have those for cooking, doing the laundry, etc.?

Heck, even a paper calendar is already kinda this, as in separating the planning phase from the execution phase.

loading story #43120832

loading story #43120550

loading story #43117263

loading story #43116360

loading story #43117748

loading story #43116311

loading story #43119341

lucianbr1 day ago | parent | next

You want to outsource thinking to a computer system and keep manual labor? You do you, but I want the opposite. I want to decide what goes where but have a robot actually put the stuff there.

TeMPOraL1 day ago | root | parent | next

That's the problem, though - the computer is already better at thinking than you, but we still don't know how to make it good at arbitrary labor requiring a mix of precision and power, something humans find natural.

In other words: I'm sorry, but that's how reality turned out. Robots are better at thinking, humans better at laboring. Why fight against nature?

(Just joking... I think.)

RedNifre1 day ago | root | parent

I think he means outsourcing everything eventually, but right now, outsourcing the thought process is possible, while outsourcing the manual labor is not.

loading story #43118459

ziofill1 day ago | parent | next

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

loading story #43118413

loading story #43118071

dartos1 day ago | parent | next

Hopefully in the next decade we’ll get there.

Vision+language multimodal models seem to solve some of the hard problems.

loading story #43117186

siavosh1 day ago | parent | next

What do humans do best?

loading story #43116171

ein0p22 hours ago | root | parent | next

Browse Instagram, apparently.

loading story #43116918

loading story #43117674

loading story #43117914

plipt22 hours ago | parent | next

The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?

The article mentions that the system in each robot uses two ai models.

    S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data

and the other

    S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.

It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.

What part of this system understands 3 dimensional space of that kitchen?

How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

    Figure robots, each equipped with dual low-power-consumption embedded GPUs

Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

loading story #43122687

bbor21 hours ago | parent

I'm very far from an expert, but:

  What part of this system understands 3 dimensional space of that kitchen?

The visual model "understands" it most readily, I'd say -- like a traditional Waymo CNN "understands" the 3D space of the road. I don't think they've explicitly given the models a pre-generated pointcloud of the space, if that's what you're asking. But maybe I'm misunderstanding?

  How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

It appears that the robot is being fed plain english instructions, just like any VLM would -- instead of the very common `text+av => text` paradigm (classifiers, perception models, etc), or the less common `text+av => av` paradigm (segmenters, art generators, etc.), this is `text+av => movements`.

Feeding the robots the appropriate instructions at the appropriate time is a higher-level task than is covered by this demo, but I think is pretty clearly doable with existing AI techniques (/a loop).

  How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

If your question is "where's the GPUs", their "AI" marketing page[1] pretty clearly implies that compute is offloaded, and that only images and instructions are meaningfully "on board" each robot. I could see this violating the understanding of "totally local" that you mentioned up top, but IMHO those claims are just clarifying that the individual figures aren't controlled as one robot -- even if they ultimately employ the same hardware. Each period (7Hz?) two sets of instructions are generated.

[1] https://www.figure.ai/ai

  What possible combo of model types are they stringing together? Or is this something novel?

Again, I don't work in robotics at all, but have spent quite a while cataloguing all the available foundational models, and I wouldn't describe anything here as "totally novel" on the model level. Certainly impressive, but not, like, a theoretical breakthrough. Would love for an expert to correct me if I'm wrong, tho!

EDIT: Oh and finally:

  Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

Surely they are downplaying the difficulties of getting this setup perfectly, and don't show us how many bad runs it took to get these flawless clips.

They are seeking to raise their valuation from ~$3B to ~$40B this month, sooooooo take that as you will ;)

https://www.reuters.com/technology/artificial-intelligence/r...

plipt21 hours ago | root | parent

    their "AI" marketing page[1] pretty clearly implies that compute is offloaded

I think that answers most of my questions.

I am also not in robotics, so this demo does seem quite impressive to me but I think they could have been more clear on exactly what technologies they are demonstrating. Overall still very cool.

Thanks for your reply

verytrivial23 hours ago | parent | next

Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)

loading story #43116559

23 hours ago | parent | next

{"deleted":true,"id":43118544,"parent":43115079,"time":1740077500,"type":"comment"}

aerodog1 day ago | parent | next

Interesting timing - same day MSFT releases https://microsoft.github.io/Magma/

loading story #43117578

pr337h4m1 day ago | parent | next

Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): https://openai.com/index/openai-technical-goals/

Symmetry23 hours ago | parent

They can put away clutter but if they could chop a carrot or dust a vase they'd have shown videos demonstrating that sort of capability.

EDIT: Let alone chop an onion. Let me tell you having a robot manipulate onions is the worst. Dealing with loose onion skins is very hard.

loading story #43120649

loading story #43119377

sandis1 day ago | parent | next

YouTube link for the video (for whatever reason the video hosted on their site kept buffering for me): https://www.youtube.com/watch?v=Z3yQHYNXPws

loading story #43116627

sottol1 day ago | parent | next

Imo, the Terminator movies would have been scarier if they moved like these guys - slow, careful, deliberate and measured but unstoppable. There's something uncanny about this.

megous1 day ago | parent

Unfortunately, there'll be no time travel to save us. That was the lying part of the movie. Other stuff was true.

kla-s1 day ago | parent | next

Does anyone know how long they have been at this? Is this mainly a reimplementation of the physical intelligence paper + the dual size/freq + the cooperative part?

pr337h4m1 day ago | parent

"Over a year" according to the founder: https://x.com/adcock_brett/status/1892578309344502191

loading story #43125061

andiareso23 hours ago | parent | next

Seriously, what's with all of these perceived "high-end" tech companies not doing static content worth a damn.

Stop hosting your videos as MP4s on your web-server. Either publish to a CDN or use a platform like YouTube. Your bandwidth cannot handle serving high resolution MP4s.

/rant

loading story #43125999

loading story #43123144

bhouston1 day ago | parent | next

When doing robot control, how do you model in the control of the robot? Do you have tool_use / function calling at the top level model which then gets turned into motion control parameters via inverse kinematic controllers?

What is the interface from the top level to the motors?

I feel it can not just be a neural network all the way down, right?

Philpax1 day ago | parent | next

Have a look at the post - it explains how it works. There are two models: a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model. The former produces a latent vector, which is then interpreted by the latter to drive the motors.

NitpickLawyer1 day ago | root | parent

> a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model.

huh. An interesting approach. I wonder if something like this can be used for other things as well, like "computer use" with the same concept of a "large" model handling the goals, and a "small" model handling clicking and stuff, at much higher rates, useful for games and things like that.

loading story #43116717

loading story #43125295

traverseda1 day ago | parent | next

"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?

jayd161 day ago | parent | next

It probably gives them confidence that they can accurately see a thing even though they don't know what that thing is.

I could also imagine a lot of safety around leaving things outside of the current task alone so you might have to bend over backwards to get new objects worked on.

thomastjeffery22 hours ago | root | parent

There is no such thing as "thing" here.

These models are trained such that the given conditions (the visual input and the text prompt) will be continued with a desirable continuation (motor function over time).

The only dimension accuracy can apply to is desirability.

loading story #43122466

ygouzerh1 day ago | parent | next

So from what I understand it actually means that they were for example never trained on a video of an apple. Maybe only on a video of bread, pineapple, chocolate.

However, as it was trained using generic text data similarly to a normal LLM, it knows how an apple is supposed to look like.

Similar than a kid that never saw a banana, but his parent described it to him.

Symmetry1 day ago | parent

It's normal to have a training set and a validation set and I interpreted that to mean that these items weren't in the training set.

loading story #43117662

loading story #43121681

loading story #43116205

loading story #43116586

ianamo23 hours ago | parent | next

Are we at a point now where Asimov’s laws are programmed into these fellas somewhere?

thomastjeffery22 hours ago | parent

Nope.

The article clearly spells out that it's end to end LLM. Text and video in, motor function out.

Technically, the text model probably has a few copies, but they are nothing more than Asimov's narrative. Laws don't (and can't) exist in a model

loading story #43119428

loading story #43121006

loading story #43116099

loading story #43122507

loading story #43118874

bilsbie1 day ago | parent | next

They should have made them talk. It’s a little dehumanizing otherwise.

anentropic1 day ago | parent | next

Very impressive

Why make such sinister-looking robots though...?

jayd161 day ago | parent | next

With the way they move, they look like stoned teenagers interning at a Bond villain factory. Not to knock the tech but they're scary and silly at the same time.

esafak1 day ago | parent

Black was not the best color choice.

anentropic7 hours ago | root | parent

Well, when you put it like that I feel a bit uncomfortable...

But it did seem like title of their mood board must have been "Black Mirror".

Very uncanny valley, the glossy facelessness. It somehow looks neither purely utilitarian/industrial nor 'friendly'. I could see it being based on the aesthetic of laptops and phones, i.e. consumer tech, but the effect is so different when transposed onto a very humanoid form.

loading story #43120248

kubb1 day ago | parent | next

Wow! This is something new.

loading story #43119607

loading story #43119001

loading story #43117571

loading story #43117131

#visit	12082758
#session	46787
#live-session	0