Hacker News new | past | comments | ask | show | jobs | submit
My anecdotal experience differs (though I hold ground that LLM evaluations are highly subjective and benchmarks are just as useful for LLMs as they are for dating websites users).

GLM 5.2 tends to stray way more than and 5.1. It also hallucinates you things subtly: morphs requirements, makes unfounded conclusions. This output is not something I experienced in any model I seen so far.

In coding it's especially annoying because it steers whole request. E.g. I give instruction: "make we a Rust-WASM-Canvas app" and GLM 5.2 goes like "Oh user surely doesn't mean that. I'll better build Dioxus app instead".

loading story #48607651
Yeah the benchmark for sure isn't perfect and without super rigid prompting it is far too easy for it to get off course. 28% hallucination rate isn't nothing either