Looking forward to the results. Thanks for your work.
Appreciate that! Results are live: https://gertlabs.com/rankings
Opus 4.8 is the first tangible improvement since Opus 4.5. And it doesn't seem to have the personality problems of the last release -- I've been enjoying using it.
Nice! Looks like it’s topping the two coding ones. I noticed it is absent from the Social Intelligence board though?
That'll populate over the next couple weeks -- those are the live games on the spectate tab which take a while to generate statistically worthwhile data. I'm curious how it does. From using it all day, I can say Opus 4.8 is my new favorite model, hands down.