Edit: No, the retrieval is Formula-Formula, the model (nor I believe tokenizer) does not handle English.
FLAME seems like a fun little model, and 60M is truly tiny compared to other LLMs, but I have no idea how good it is in today's context, and it doesn't seem like they ever released it.
Running a prompt against every single cell of a 10k row document was never gonna happen with a large model. Even using a transformer model architecture in the first place can be seen as ludicrous overkill but feasible on modern machines.
So I'd say the paper is very relevant, and the top commenter in this very thread demonstrated their own homegrown version with a very nice use-case (paper abstract and title sorting for making a summary paper)
That isn’t the main point of FLAME, as I understood it. The main point was to help you when you’re editing a particular cell. codex-davinci was used for real time Copilot tab completions for a long time, I believe, and editing within a single formula in a spreadsheet is far less demanding than editing code in a large document.
After I posted my original comment, I realized I should have pointed out that I’m fairly sure we have 8B models that handily outperform codex-davinci these days… further driving home how irrelevant the claim of “>100B” was here (not talking about the paper). Plus, an off the shelf model like Qwen2.5-0.5B (a 494M model) could probably be fine tuned to compete with (or dominate) FLAME if you had access to the FLAME training data — there is probably no need to train a model from scratch, and a 0.5B model can easily run on any computer that can run the current version of Excel.
You may disagree, but my point was that claiming a 60M model outperforms a 100B model just means something entirely different today. Putting that in the original comment higher in the thread creates confusion, not clarity, since the models in question are very bad compared to what exists now. No one had clarified that the paper was over a year old until I commented… and FLAME was being tested against models that seemed to be over a year old even when the paper was published. I don’t understand why the researchers were testing against such old models even back then.
I'm fine with and prefer specialist models in most cases.