GLM-5.1: Towards Long-Horizon Tasks

537zixuanlimit | 19 hours ago | 221 | HN

We're still adding samples, but some early takeaways from benchmarking on https://gertlabs.com:

Contrary to the model card, its one-shot performance is more impressive than its agentic abilities. On both metrics, GLM 5.1 is competitive with frontier models.

But keeping in mind this is an open source model operating near the frontier, it's nothing short of incredible.

I suspect 2 issues with the model are keeping it from fully realizing its potential in agentic harnesses: - Context rot (already a common complaint). We are still working on a metric to robustly test and visualize this on the site. - The model was most likely overtrained on standardized toolsets and benchmarks, and isn't as adaptive in using arbitrary tooling in our custom harness simulations. We've decided to commit to measuring intelligence as the ability to use custom, changing tools, instead of being trained to use specific tools (while still always providing a way to run local bash and other common tools). There are arguments to be made for either, but the former is more indicative of general intelligence. Regardless, it's a subtle difference and GLM 5.1 still performs well with tooling in our environments.

Crazy week for open source AI. Gemma 4 has shown that large model density is nowhere near optimized. Moats are shrinking.

If there are more representations of model performance you'd like to see, I'm actively reading your feedback and ideas.

loading story #47685358

loading story #47686329

loading story #47685264

Ms-J6 hours ago | parent | next

Z.ai and their GLM models are pretty low quality.

I've been testing it for awhile now since it seemed to have potential as a local model.

With this new update it still cannot parse simple, test PDFs correctly. It inconsistently tells me that the value in the name field in the document is incorrect, and has the name reversed to put the last name first. Or that a date is wrong as it's in the past/future, when it is not. Tons of fundamental errors like that.

Even when looking at the thinking process there are issues:

I used a test website for it to analyze and it says that the sites copyright year states 2026 which is in the future and to investigate as it could be an attack, but right after prints today's correct date.

I'm in the process of trying to get it uncensored. Hopefully that will create some use out of z.ai

Edit: by the way, which is the best uncensored model at the moment?

loading story #47685813

loading story #47687210

loading story #47686959

loading story #47686954

loading story #47686519

loading story #47685512

Yukonv18 hours ago | parent | next

Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

[0] https://huggingface.co/unsloth/GLM-5.1-GGUF

zozbot23417 hours ago | parent

SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.

Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.

loading story #47680506

simonw14 hours ago | parent | next

Not only did this one draw me an excellent pelican... it also animated it! https://simonwillison.net/2026/Apr/7/glm-51/

stingraycharles9 hours ago | parent | next

Surely at this point it’s part of the training set and the benchmark has lost its value?

ipsum214 hours ago | parent | next

It made it realistic. A pelican is much more likely to be flying in the sky than riding a bicycle.

_pdp_13 hours ago | parent

Simon, you need to come up with improved benchmarks soon.

loading story #47682316

loading story #47683050

dvt12 hours ago | parent | next

Every single day, three things are becoming more and more clear:

    (1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat
    (2) Local/private inference is the future of AI
    (3) There's *still* no killer product yet (so get to work!)

bottlepalm9 hours ago | parent | next

This has got to be bait..

1) OpenAI and Anthropic are killing it, and continue to do so, their coding tools are unmatched for professionals.

2) Local models don't hold a candle to SOTA models and there's nothing on the horizon that indicates that consumers will be able to run anything close to what you can get in a data center.

3) Coding is a killer product, OpenAI and Anthropic are raking in the cash. The top 3 apps are apps in the app store are AI. Everyone who knows anything is using AI, every day, across the economy.

svcrunch8 hours ago | root | parent | next

The grandparent is definitely wrong on (3). Yes, coding is a killer product, I agree with you.

On (2), I agree with you for local models. BUT, there are also the open source Chinese models accessible via open-router. Your argument ("don't hold a candle to SOTA models") does not hold if the comparison is between those.

On (1), I agree more with the grandparent than with your assessment. Yes, OpenAI and Anthropic are killing it for now, but the time horizon is very short. I use codex and claude daily, but it's also clear to me that open source is catching up quickly, both w.r.t. the models and the agentic harnesses.

loading story #47685393

loading story #47684735

dvt8 hours ago | root | parent

I don't want to respond to 100 comments about the same thing, and this one happens to be on top, so, in my humble opinion:

(1): You don't have to be an Ed Zitron disciple to infer that OpenAI and Anthropic are likely overvalued and that Nvidia is selling everyone shovels in a gold rush. AI is a game-changing technology, but a shitty chat interface does not a company make. OpenAI and Anthropic need to recoup astronomical costs used in training these models. Models that are now being distilled[1] and are quickly becoming commoditized. (And frankly, models that were trained by torrenting copyrighted data[2], anyway.) Many have been calling this out for years: the model cannot be your product. And to be clear, OpenAI/Anthropic most definitely know this: that's why they've been aquihiring like crazy, trying to find that one team that will make the thing.

(2): Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this. Go use an almost-SOTA model (a big Deepseek or Qwen model) offered by many bare-metal providers and you'll see what "true" token prices should look like. The end-state here is likely some models running locally and some running in the cloud. But the current state of OpenClaw token-vomit on top of Claude is fiscally untenable (in fact, this is why Anthropic shut it down).

(3): This is typical Dropbox HN snark[3], of which I am also often guilty of. I really don't think AI coding is a killer product and this seems very myopic—engineers are an extreme minority. Imo, the closest we've seen to something revolutionary is OpenClaw, but it's janky, hard to set up, full of vulnerabilities, and you need to buy a separate computer. But there's certainly a spark there. (And that's personally the vertical I'm focusing on.)

[1] https://www.anthropic.com/news/detecting-and-preventing-dist...

[2] https://media.npr.org/assets/artslife/arts/2025/complaint.pd...

[3] https://news.ycombinator.com/item?id=9224

loading story #47686784

jimmaswell11 hours ago | parent | next

No killer product? Coding assistants and LLM's in general are the single most awe-inspiring achievement of humanity in my lifetime, technological or otherwise. They've already massively improved my and others' lives and they're only going to get better. If pre and post industrial revolution used to be the major binary delineation of our history, I'm fairly confident it will soon be seen as pre and post AI instead.

zozbot23410 hours ago | root | parent | next

Coding assistants are currently quite hard to run locally with anything like SOTA abilities. Support in the most popular local inference frameworks is still extremely half-baked (e.g. no seamless offload for larger-than-RAM models; no support for tensor-parallel inference across multiple GPUs, or multiple interconnected machines) and until that improves reliably it's hard to propose spending money on uber-expensive hardware one might be unable to use effectively.

loading story #47684024

loading story #47684335

pdntspa11 hours ago | root | parent | next

I know right? 8-year-old me dreamed of being able to articulate software to a computer without having to write code. It (along with the original Stable Diffusion) are Definitely one of the coolest inventions to ever come along in my lifetime

Covenant00285 hours ago | root | parent | next

> Coding assistants and LLM's in general are the single most awe-inspiring achievement of humanity in my lifetime

Landing a man on the moon is way more impressive. Finding several vaccines for a once in a century pandemic within a year of its outbreak is and achievement that in its impact and importance dwarfs what the entire LLM industry put together has achieved. The near-complete eradication of polio, once again, way more important and impactful.

loading story #47687947

bitexploder11 hours ago | root | parent | next

No killer products... just robots that can do vulnerability analysis at the level of a decent security engineer and write code without tiring.

allan_s11 hours ago | root | parent | next

I've also been using the LLM in Posthog and it has been impressive. I need to check if I can also plug a MCP/Skill to my actual claude code so that I can cross reference the data from my other data source (stripe, local database, access logs etc.) for in depth analysis

loading story #47683500

rimliu3 hours ago | root | parent

yeah, painting yourself into a corner at 10x speed is hardly the most awe-inspiring achievement of humanity.

Glaklloo3 hours ago | parent | next

Google doesn't release Gemma 4 if Gemini is similiar good.

We probably talk abuot a year of progress diffeerence.

Its also still quite expensive for an avg person to consume any of it. Either due to hardware invest, energy cost or API cost.

Also professionally I don't think anyone will really spend a little bit less money of having the 3th quality model running if they can run the best model.

I'm happy that we reach levels were this becomes an alternative if you value open and control though.

eldenring11 hours ago | parent | next

I don't see how its possible to think this. AI coding assistants are some of the most useful technologies ever created, and model quality is by far the most important thing, so I doesn't make sense why local inference would be the path forward unless something fundamentally changes about hardware.

sunir9 hours ago | root | parent

The hardware will change. We know that.

grafmax10 hours ago | parent | next

> no moat

I'd like to think the superior product wins. But Windows still thrives despite widespread Linux availability. I think sometimes we can underestimate the resilience of the tech oligopolies, particularly when they're VC-funded.

jjfoooo410 hours ago | root | parent | next

VC can spend all the money in the world and it won't matter if the cost of switching providers is effectively zero.

If I want to switch from Windows to Linux, I have to reconsider a whole variety of applications, learn a different UX, migrate data, all sorts of annoyances.

When I switch between Codex and Claude Code, there is literally no difference in how I interact with them. They and a number of other competitors are drop in replacements for each other.

AlienRobot8 hours ago | root | parent

>I'd like to think the superior product wins. But Windows still thrives despite widespread Linux availability.

That's because by most metrics Linux is inferior is Windows.

kcb11 hours ago | parent | next

What benefit is there to dropping $50k on GPUs to run this personally besides being a cool enthusiast project?

marcus_holmes10 hours ago | root | parent | next

Why would anyone need more than 640Kb of memory?

loading story #47684045

deminature11 hours ago | root | parent | next

Intel has just released a high VRAM card which allows you to have 128GB of VRAM for $4k. The prices are dropping rapidly. The local models aren't adapted to work on this setup yet, so performance is disappointing. But highly capable local models are becoming increasingly realistic. https://www.youtube.com/watch?v=RcIWhm16ouQ

loading story #47684110

blizdiddy11 hours ago | root | parent | next

Is it so hard to project out a couple product cycles? Computers get better. We’ve gone from $50k workstation to commodity hardware before several times

loading story #47684008

CamperBob210 hours ago | root | parent | next

It will run exactly the same tomorrow, and the next day, and the day after that, and 10 years from now. It will be just as smart as the day you downloaded the weights. It won't stop working, exhaust your token quota, or get any worse.

That's a valuable guarantee. So valuable, in fact, that you won't get it from Anthropic, OpenAI, or Google at any price.

loading story #47684695

fwipsy10 hours ago | root | parent

Agree directionally but you don't need $50k. $5k is plenty, $2-3k arguably the sweet spot.

loading story #47683845

loading story #47683995

hodgehog1110 hours ago | parent | next

(1) is absolutely not true if you actually use these models on a regular basis and include Google in here too. The difference in reliability beyond basic tasks is night and day. Their reward function is just so much better, and there are many nuanced reasons for this.

(2) is probably true but with caveats. Top-tier models will never run on desktop machines, but companies should (and do) host their own models. The future is open-weight though, that much is for sure.

(3) This is so ignorant that others have already responded to it. Look outside of your own bubble, please.

neonstatic10 hours ago | root | parent

> Top-tier models will never run on desktop machines

Sorry, but you don't know that

loading story #47684013

AlienRobot9 hours ago | parent | next

I was trying to use Claude.ai today to learn how to do hexagonal geometry.

Every time I asked a question it generated an interactive geometry graph on the fly in Javascript. Sometimes it spent minutes compiling and testing code on the server so it could make sure it was correct. I was really impressed.

Anyway I couldn't really learn anything since when the code didn't work I wasn't sure if I had ported it wrong or the AI did it wrong, so I ended up learning how to calculate SDF and pixel to hex grid from tutorials I found on google instead.

jurschreuder5 hours ago | root | parent

This is also my exact experience

mgfist11 hours ago | parent | next

Posted this after mythos came out? The hutzpah

fwipsy10 hours ago | parent | next

No moat: yes. Cooked: no. It's a race. Why assume they're going to lose? It relies on (2) which is only true if AI usefulness plateaus at some level of compute. That's a huge claim to be making at this stage. (3) AI has lots of killer products already. The big one is filling in moats. Unrealized potential though for sure.

DeathArrow7 hours ago | parent | next

>(1) OpenAI & Anthropic are absolutely cooked; it's obvious they have no moat

I think big corporations will continue to use them no matter how cheap and good other models are. There's a saying: nobody was fired for buying IBM.

IncreasePosts9 hours ago | parent | next

How good would open source models be if they couldn't distill higher quality private models?

anon29111 hours ago | parent | next

(3) is simply a lie spread by engineers who have no other context. I manage some real estate (mid-term rentals) and everyone I know has switched over to AI robo-handlers to do the contact at this point. It's almost a passive investment at this point. Some can even handle interfacing with contractors and service requests for you. Revolutionized the field in my opinion.

neonstatic10 hours ago | parent

The model is the killer product

alex7o18 hours ago | parent | next

To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.

InsideOutSanta15 hours ago | parent | next

I just set the context window to 100k and manage it actively (e.g. I compact it regularly or make it write out documentation of its current state and start a new session).

For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.

disiplus17 hours ago | parent | next

When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).

i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.

loading story #47685711

MegagramEnjoyer17 hours ago | parent | next

Why is that sad? A free and open source model outperforming their closed source counterparts is always a win for the users

loading story #47679355

DeathArrow17 hours ago | parent | next

After the context gets to 100k tokens you should open a new session or run /compact.

cmrdporcupine17 hours ago | parent | next

I honestly still hold onto habits from earlier days of Claude & Codex usage and tend to wipe / compact my context frequently. I don't trust the era of big giant contexts, frankly, even on the frontier models.

loading story #47679918

csomar7 hours ago | parent | next

I've set max context to 180k and usually compact around 120k. It's much better to re-read stuff than to have it under-perform when it's over 120k.

varispeed16 hours ago | parent

Isn't the same with opus nowadays?

johnfn17 hours ago | parent | next

GLM-5.0 is the real deal as far as open source models go. In our internal benchmarks it consistently outperforms other open source models, and was on par with things like GPT-5.2. Note that we don't use it for coding - we use it for more fuzzy tasks.

sourcecodeplz17 hours ago | parent | next

Yep, haven't tried 5.1 but for my PHP coding, GLM-5 is 99% the same as Sonnet/Opus/GPT-5 levels. It is unbelievably strong for what it costs, not to mention you can run it locally.

deepsquirrelnet17 hours ago | parent | next

I am working on a large scale dataset for producing agent traces for Python <> cython conversion with tooling, and it is second only to gemini pro 3.1 in acceptance rates (16% vs 26%).

Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).

I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.

foopod9 hours ago | parent | next

It really bothers me that people refer to open weight models as being open source. They fundamentally aren't and are more akin to freeware than anything else.

epolanski15 hours ago | parent

Same thing I noticed.

My use cases are not code editing or authoring related, but when it comes to understanding a codebase and it's docs to help stakeholders write tasks or understand systems it has always outperformed american models at roughly half the price.

clark10133 hours ago | parent | next

I’ve been using GLM 5.1 instead of GPT 5.4 for a few days now, and it’s working smoothly.

winterqt17 hours ago | parent | next

Comments here seem to be talking like they've used this model for longer than a few hours -- is this true, or are y'all just sharing your initial thoughts?

KaoruAoiShiho17 hours ago | parent | next

Blog post is new but the model is about 2 weeks in public.

stavros17 hours ago | parent | next

My local tennis court's reservation website was broken and I couldn't cancel a reservation, and I asked GLM-5.1 if it can figure out the API. Five minutes later, I check and it had found a /cancel.php URL that accepted an ID but the ID wasn't exposed anywhere, so it found and was exploiting a blind SQL injection vulnerability to find my reservation ID.

Overeager, but I was really really impressed.

loading story #47680000

loading story #47682944

loading story #47679853

loading story #47679815

loading story #47680479

BeetleB17 hours ago | parent

It's been out for a while.

loading story #47687482

minimaxir17 hours ago | parent | next

The focus on the speed of the agent generated code as a measure of model quality is unusual and interesting. I've been focusing on intentionally benchmaxxing agentic projects (e.g. "create benchmarks, get a baseline, then make the benchmarks 1.4x faster or better without cheating the benchmarks or causing any regression in output quality") and Opus 4.6 does it very well: in Rust, it can find enough low-level optimizations to make already-fast Rust code up to 6x faster while still passing all tests.

It's a fun way to quantify the real-world performance between models that's more practical and actionable.

kamranjon16 hours ago | parent | next

I'm crossing my fingers they release a flash version of this. GLM 4.7 Flash is the main model I use locally for agentic coding work, it's pretty incredible. Didn't find anything in the release about it - but hoping it's on the horizon.

XCSme11 hours ago | parent | next

GLM 5.1 does worse than GLM 5 in my tests[0] (both medium reasoning OR no reasoning).

I think the model is now tuned more towards agentic use/coding than general intelligence.

[0]: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

XCSme11 hours ago | parent

The (none) version especially shows considerable degradation.

RickHull18 hours ago | parent | next

I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.

unicornfinder18 hours ago | parent | next

I'm on their pro plan and I respectfully disagree - it's genuinely excellent with GLM 5.1 so long as you remember to /compact once it hits around 100k tokens. At that point it's pretty much broken and entirely unusable, but if you keep context under about 100k it's genuinely on par with Opus for me, and in some ways it's arguably better.

loading story #47678899

loading story #47679028

kay_o18 hours ago | parent | next

I am on the mid tier Coding plan to trying it out for the sake of curiosity.

During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files

loading story #47680441

InsideOutSanta15 hours ago | parent | next

My impression is that different users get vastly different service, possibly based on location. I live in Western Europe, and it works perfectly for me. Never had a single timeout or noticeable quality degradation. My brother lives in East Asia, and it's unusable for him. Some days, it just literally does not work, no API calls are successful. Other days, it's slow or seems dumber than it should be.

loading story #47684756

loading story #47685258

satvikpendem18 hours ago | parent | next

Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.

wolttam18 hours ago | parent | next

This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent

Mashimo18 hours ago | parent | next

I'm also on the lite plan and have been using 5.1 for a few days now. It works fine for me.

But it's all casual side projects.

Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.

LaurensBER17 hours ago | parent | next

I'm on their lite plan as well and I've been using it for my OpenClaw. It had some issues but it also one-shotted a very impressive dashboard for my Twitter bookmarks.

For the price this is a pretty damn impressive model.

cmrdporcupine17 hours ago | parent | next

Is there any advantage to their fixed payment plans at all vs just using this model via 3rd party providers via openrouter, given how relatively cheap they tend to be on a per-token basis?

Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1

$1.40 in $4.40 out $0.26 cached

/ 1M tokens

That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.

I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.

loading story #47682694

loading story #47685242

benterix17 hours ago | parent | next

> Obvious quantization issues

Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?

loading story #47680816

esafak17 hours ago | parent | next

I'm on their Lite plan and I see some of this too. It is also slow. I use it as a backup.

margorczynski18 hours ago | parent

It has been useless for long time when compared to Opus or even something like Kimi. The saving grace was that it was dirt cheap but that doesn't matter if it can't do what I want even after many repeated tries and trying to push it to a correct solution.

mark_l_watson15 hours ago | parent | next

I can’t wait to try it. I set up a new system this morning with OpenClaw and GLM-5, and I like GLM-5 as the backend for Claude Code. Excellent results.

8dazo11 hours ago | parent | next

Just saw the Claude Mythos post. Not sure when it’s going public, but this feels like a real jump, not just incremental progress. Also waiting for the next GLM release coz specs are looking kind of insane.

zozbot23410 hours ago | parent

Gemini and GPT have Deep Research models already, Mythos looks like much the same thing.

dryarzeg13 hours ago | parent | next

A bit off-topic, but for some reason, even though I don't use LLMs for my job or for my hobbies, or in daily life frequently (and when I do, it's mostly some kind of "rubber duck brainstorm"), when I see open-weight releases like this one or the recent Gemma 4 (which is very good for local models); the first time was with DeepSeek-R1 (this one, despite being blamed for "censorship", was heavily censored only via DeepSeek API, the local model - full-weight 685B, not the distilled ones - was pretty much unhinged regarding censorship on any topic)... there's always one song coming to mind and I simply can't get rid of it no matter how hard I try.

"I am the storm that is approaching, provoking..." : )

kirby8817 hours ago | parent | next

I wonder how that compare to harness methods like MAKER https://www.cognizant.com/us/en/ai-lab/blog/maker

blazespin13 hours ago | parent | next

Anthropic's reply? A model you can't use.

minimaxir13 hours ago | parent

Mythos is most definitely not in response to this announcement.

gavinray17 hours ago | parent | next

I find the "8 hour Linux Desktop" bit disingenuous, in the fine print it's a browser page:

  > "build a Linux-style desktop environment as a web application"

They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.

We all know that building a spec-compliant browser alone is a herculean task.

MrPowerGamerBR16 hours ago | parent | next

In my opinion it would be way cooler if it actually created a real Linux desktop environment instead of only a replica.

Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.

I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.

bredren17 hours ago | parent

It is a big claim without the source and prompting.

tgtweak17 hours ago | parent | next

Share the harness for that browser linux OS task :)

DeathArrow17 hours ago | parent | next

I am already subscribed to their GLM Coding Pro monthly plan and working with GLM 5.1 coupled with Open Code is such a pleasure! I will cancel my Cursor subscription.

epolanski15 hours ago | parent | next

I was very satisfied with GLM5, I'm not gonna lie.

Excited to test this.

EITB_20266 hours ago | parent | next

Good One Though

philipwhiuk11 hours ago | parent | next

This is the flip side of the Project Glasswing stuff...

Everyone else isn't that far behind and they aren't all gonna just wall off their new model.

A reason that Anthropic will eventually give is 'the competition can do what Glasswing can do so what's the point limiting it'.

bigyabai18 hours ago | parent | next

It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.

For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.

cassianoleal17 hours ago | parent | next

I've done some very long sessions on OpenCode with Dynamic Context Pruning. Highly recommend it.

https://github.com/Opencode-DCP/opencode-dynamic-context-pru...

embedding-shape18 hours ago | parent | next

> It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts

Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?

loading story #47678404

whimblepop18 hours ago | parent | next

That's pretty few, at least for the way I'm currently using LLMs. I have them do some Nix work (both debugging and coding) where accuracy and quality matters to me, so they're instructed to behave as I would when it comes to docs, always consulting certain docs and source code in a specific order. It's not unusual for them to chew through 200k - 600k tokens in a single session before they solve everything I want them to. That's what I currently think of when I think of "long horizon within a single context window".

So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.

nkko17 hours ago | parent | next

Yes, this is frustrating, but it doesn’t occur in CC. I run the conversation logs through an agent and opencode source, and it identified an issue in the reasoning implementation of opencode for Zai models. Consequently, I ceased my research and opted to use CC instead.

jauntywundrkind18 hours ago | parent | next

Chiming in to second this issue. It is wildly frustrating.

I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).

However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.

Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.

It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.

I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...

All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.

loading story #47681077

loading story #47681493

loading story #47681478

loading story #47679046

HumanOstrich17 hours ago | parent | next

I wonder if running the compaction in a degraded state produces a subpar summary to continue with.

loading story #47681224

azuanrb18 hours ago | parent

Have you compared it with using Claude Code as the harness? It performs much better than opencode for me.

jaggs17 hours ago | parent | next

How does it compare to Kimi 2.5 or Qwen 3.6 Plus?

XCSme11 hours ago | parent | next

General intelligence (not coding) comparison: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

loading story #47685177

eis17 hours ago | parent | next

The blog post has a benchmark comparison table with these two in it

loading story #47680268

DeathArrow17 hours ago | parent

Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.

loading story #47687970

dang18 hours ago | parent | next

[stub for offtopicness]

[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]

smith701818 hours ago | parent | next

Hmm, three spam comments posted within 9 minutes of each other. The accounts were created 15 minutes ago, 51 days ago, and 3 months ago.

Interesting.

Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.

loading story #47678268

loading story #47678422

loading story #47678479

zendi19 hours ago | parent | next

[flagged]

louszbd19 hours ago | parent | next

[flagged]

seven292819 hours ago | parent

[flagged]

meidad_g11 hours ago | parent | next

[dead]

aryehof6 hours ago | parent | next

[dead]

aplomb102618 hours ago | parent | next

[dead]

EddyAI16 hours ago | parent | next

[dead]

andrewmcwatters18 hours ago | parent | next

[dead]

maxdo16 hours ago | parent

One of the bench maxed models . Every time I tried it , it’s not on par even with other open source models .

wallmountedtv13 hours ago | parent

Feeling very much the same. Attempting to use it through Claude Code as a model it just completely lost all context on what it was doing after a few months and kept short circuiting even with the most helpful prompts I could give, outside of just writing out the answer myself. I really do not get the praise for this model.

Being "better than Opus 4.6" is not really something a benchmark will tell you. It's much more a consensus of users liking the flavor of an answer, rather than fueling x% correct on a benchmark.

#visit	13,255,179
#session	74,665
#live-session	0