Short version: if a model can answer a very high proportion of questions from a benchmark accurately then the next step is to ask it two or more questions at a time. On some models the quality of answers varies dramatically with which is asked first.
>Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.
Very good start, but the statistics of the generated text matter a lot.
As an example on a dumb as bricks benchmark I've designed I can saturate the reasoning capabilities of all non-reasoning models just by varying the names of objects in the questions. A model that could get a normalized score of 14 with standard object strings could get a score as high as 18 with one letter strings standing for objects and as low as zero with arbitrary utf-8 character strings - which turns out mattered a lot since all the data was polluted with international text coming from the stock exchanges.
Feel free to drop me a line if you're interested in a more in depth conversation. LLMs are _ridiculously_ under tested for how many places they show up in.