The extra long documentation and abstractions for me personally are exactly what I DONT want to have in a benchmarking repo. I.e. what transformers version is this, will it support TGI v3, will it automatically remove thinking traces with a flag in the code or running command, will it run the latest models that need custom transformer version etc.
And if it's not a locally runnable product it should at least have a public accessable leaderboard to submit oss models too or something.
Just my opinion. I don't like it. It looks like way too much docs and code slop for what should just be a 3 line command.
We actually designed it to make it easily work off any API. How it works is you just have to create a wrapper around your API and you're good to go. We take care of the async/concurrent handling of such benchmarking so the evaluation speed is really just limited by the rate limit of your LLM API.
This link shows what a wrapper looks like: https://docs.confident-ai.com/guides/guides-using-custom-llm...
And once you have your model wrapper setup, you can use any benchmark we provide.