Hacker News new | past | comments | ask | show | jobs | submit
I would like to disagree with its being irrelevant. If anything, the 100B models are irrelevant in the context and should be seen as a "fun inclusion" rather than a serious addition worth comparing against. It out-performing a 100B model at the time becomes a fun bragging point, but it's not the core value of the method or paper.

Running a prompt against every single cell of a 10k row document was never gonna happen with a large model. Even using a transformer model architecture in the first place can be seen as ludicrous overkill but feasible on modern machines.

So I'd say the paper is very relevant, and the top commenter in this very thread demonstrated their own homegrown version with a very nice use-case (paper abstract and title sorting for making a summary paper)

> Running a prompt against every single cell of a 10k row document was never gonna happen with a large model

That isn’t the main point of FLAME, as I understood it. The main point was to help you when you’re editing a particular cell. codex-davinci was used for real time Copilot tab completions for a long time, I believe, and editing within a single formula in a spreadsheet is far less demanding than editing code in a large document.

After I posted my original comment, I realized I should have pointed out that I’m fairly sure we have 8B models that handily outperform codex-davinci these days… further driving home how irrelevant the claim of “>100B” was here (not talking about the paper). Plus, an off the shelf model like Qwen2.5-0.5B (a 494M model) could probably be fine tuned to compete with (or dominate) FLAME if you had access to the FLAME training data — there is probably no need to train a model from scratch, and a 0.5B model can easily run on any computer that can run the current version of Excel.

You may disagree, but my point was that claiming a 60M model outperforms a 100B model just means something entirely different today. Putting that in the original comment higher in the thread creates confusion, not clarity, since the models in question are very bad compared to what exists now. No one had clarified that the paper was over a year old until I commented… and FLAME was being tested against models that seemed to be over a year old even when the paper was published. I don’t understand why the researchers were testing against such old models even back then.