I've seen this reply to Simon's benchmark for 2 years running now, and yet you still see improvements and objectively-bad results over time from new releases, even when I'm sure every frontier AI team has/had a person at least partially dedicated to better bicycle-pelican SVG outputs. Alas.
I had intended to caveat that: I'm sure I'm not the first person to ask about this!
> you still see improvements
This is expected if they are training their models on it, right?
> objectively-bad results
Keen to learn when this has been the case, i.e. across version increments in major models.
I've written about this a couple of times, most notably here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
I've been enjoying seeing how the quality of individual models differ based on the amount of reasoning effort you give them. If they were baking an a good pelican you wouldn't expect them to differ so much.
(Google Gemini are the only lab that have very clearly paid attention to the quality of SVG animals-riding-vehicles, see their announcement for Gemini 3.1: https://twitter.com/JeffDean/status/2024525132266688757 )
I honestly assumed their comment was tongue in cheek humour, because positively no one actually cares how these models generate an SVG pelican riding a bicycle. It's some meme thing that this stuff always appears here.