Story Detail of id 47362071 | Liveview Hacker News

bonoboTP12 hours ago | on: Executing programs inside transformers with exponentially faster inference

This shows the downside of using AI to write up your project. I see the eloquent sentences, but don't get the message.

> This works, but the actual execution happened outside the model. The model specified the computation, then waited for an external system to carry it out. > Our transformer also emits a program, but instead of pausing for an external tool, it executes that program itself, step by step, within the same transformer.

What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?

Why is it good that it's "inside" the model? Just making it more elegant and nice? The tool was already "inside" the overall hybrid system. What's the actual problem?

famouswaffles11 hours ago | parent | next

>This shows the downside of using AI to write up your project. I see the eloquent sentences, but don't get the message.

Not really sure what this obsession with calling things you don't like AI generated is but it's poor form. If you have something to say about the text then say it. Otherwise leave baseless accusations out of it.

>What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so?....

It's pretty clearly an ideological thing. Some people are firmly on the 'some sort of symbolic logic is necessary' camp. From the article, 'A system that cannot compute cannot truly internalize what computation is.'

Some things are just interesting for the sake of it. This is one of those things. I don't agree with the authors on the above and I'm still glad they shared. It's a very interesting read regardless.

loading story #47362393

loading story #47362500

loading story #47362237

loading story #47362566

loading story #47362884

loading story #47364771

radarsat110 hours ago | parent | next

> Is it speed?

> Is it that you can backprop through this computation? Do you do so?

With respect, I feel that you may not have read the article.

> Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.

and,

> By storing points across nested convex hulls, this yields a decoding cost of O(k+log⁡ n).

and,

> Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models.

So yes, and yes.

> Where are the benchmarks?

Not clear what they should benchmark it against. They do compare speed to a normal KV Cache. As for performance.. if it's actually executing a Sudoku solver with a 100% success rate, it seems pretty trivial to find any model doing < 100% success rate. Sure, it would be nice to see the data here, agree with you there.

Personally I think it would be really interesting to see if this method can be combined with a normal model MoE-style. It is likely possible, the router module should pick up quite quickly that it predicts the right tokens for some subset of problems deterministically. I like the idea of embed all sorts of general solvers directly into the model, like a prolog solver for example. In fact it never would have occurred to me to just go straight for WASM, pretty interesting choice to directly embed a VM. But it makes me wonder what "smaller" interpreters could be useful in this context.

loading story #47364267

loading story #47364708

loading story #47366019

bsenftner9 hours ago | parent | next

Well, for one, by eliminating external tool calling, the model gains an amount of security. This occurs because the tools being called by an LLM can be corrupted, and in this scenario corrupted tools would not be called.

maytc10 hours ago | parent | next

The key difference is that the model is able to write the program as it’s executing it.

Before it needs to write the code and have an external program execute it. Here it can change its mind mid execution. Kinda like what was observed in the CoT’s ah ha moment

armchairhacker10 hours ago | parent | next

What are the AI tells? The only one I found is redundancy, but it makes sense because this is trying to be approachable to laymen.

Like, you have a great point (the benefit of this approach isn't explained), but that's a mistake humans frequently make.

loading story #47368032

loading story #47364680

andy12_11 hours ago | parent | next

Honestly, the most interesting thing here is definitely that just 2D heads are enough to do useful computation (at least they are enough to simulate an interpreter) and that there is an O(log n) algorithm to compute argmax attention with 2D heads. It seems that you could make an efficient pseudosymbolic LLM with some frozen layers that perform certain deterministic operations, but also other layers that are learned.

loading story #47365143

#visit	13,092,324
#session	74,665
#live-session	0