Hacker News new | past | comments | ask | show | jobs | submit

Claude Fable is relentlessly proactive

https://simonwillison.net/2026/Jun/11/fable-is-relentlessly-proactive/
This to me reads like a poignant commentary on the catastrophic loss of human agency, with the actual commit being highly revealing [0].

Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.

An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.

[0] https://github.com/datasette/datasette-agent/commit/a75a8b72...

loading story #48504828
loading story #48501603
loading story #48501467
loading story #48505041
loading story #48503321
loading story #48503807
loading story #48505134
loading story #48503876
loading story #48505087
loading story #48502144
loading story #48501399
> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.

> Running coding agents outside of a sandbox has always been a bad idea

I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.

It's like posting a video of yourself in the passenger seat of a car, with your feet up on the dashboard, and saying: "Remember, if you're doing this and you get in a crash, the airbags are likely to break your legs or worse! Boy, I sure am glad that didn't happen to me!"

loading story #48500187
loading story #48499395
loading story #48499626
loading story #48498969
loading story #48500852
loading story #48499769
loading story #48499672
loading story #48499198
loading story #48498937
loading story #48499152
loading story #48500704
loading story #48500031
loading story #48500159
loading story #48499373
loading story #48499018
loading story #48499027
loading story #48499083
loading story #48499262
loading story #48499767
loading story #48499819
loading story #48499409
loading story #48498970
loading story #48500334
loading story #48499301
loading story #48499302
loading story #48505426
Fable feels like a version of Opus running on a harness that won't let it halt until it's sure the issue is fixed, which makes sense if what you want is a model that's better at benchmarks.

It's a very good model, but it comes at a huge premium: not only do the tokens cost more, but the model itself really wants to spend them all. For example, working with React Native, Fable never just says "okay, I did the thing, that's it." It tries to rebuild the entire app from scratch, run the whole test suite, and watch every log and warning.

This is the first time with LLMs I've felt that upgrading to a model isn't worth it, even if my company lets me use it, because all the building / testing was just destroying my machine and its battery, which keeps me from working on other things.

For now, it feels like Opus with ultracode is a better choice (less pollution of the main context, more parallelism in investigations).

loading story #48499292
loading story #48499040
loading story #48499484
loading story #48501370
loading story #48499844
loading story #48499091
> which makes sense if what you want is a model that's better at benchmarks

This so much.

Opus 4.6 was the last Anthropic model that was good at assisting you, 4.7 and later ones have completely inverted this relationship and it's you assisting it.

Yes, I admit they are smarter, I admit we've reached a point where LLMs are more creative and could be writing better code (albeit with some design hiccups) than I do, but they are also increasingly bad at helping me.

Sure, they do my job when prompted 8 times out of 10 (but then, what's the point of having me anyway?), but my issue is that when I try to invert the relationship they will keep jumping onto solving the issues themselves and disregard my feedback or request.

E.g. I wanted to know some DNS details of an emailer module in Fable 5 and it jumped onto "why I should've used magic links", it just not did what asked.

E.g. 2. There was a worker machine that had an environment misconfiguration and I tasked it to find which github action was setting that specific flag and where. Instead of answering a question, it jumped into just hardcoding it in the code.

E.g. 3. I had some issues with batching, and while I tasked it to investigate whether batching was needed at all for that particular problem (hint, it wasn't) it went and changed the batching logic as to fix the bug.

I am extremely disappointed with Fable's personality.

I can clearly see it's strong, but I'm wondering whether the relationship of LLMs as assistant has broken forever, and it's us now that are being tasked into assisting them instead, because that's how it feels.

The training/reinforcement is clearly biased towards solving problems, not answering questions.

Fable was trying to verify a UI change in my game. I was working in another window and noticed a program opening on my task bar. Fable had opened the game through the CLI using a movie maker tool, recorded the output, took a frame from the end of it, and used that to verify the UI. When my game's welcome screen obstructed what it wanted to see, it created a temporary worktree, deleted the welcome screen, and ran the movie maker again.

I watched the whole thing thinking it could've just asked me for a screenshot and saved the tokens. But still, I couldn't help but be impressed. Opus never would've done that.

loading story #48500168
loading story #48503652
It feels like Fable is slightly smarter but overall worse tool exactly due to this.

It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.

I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.

I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.

loading story #48505480
I actually think internally they knew they hit diminishing returns awhile ago.

They’ve been doing a lot of strategic introduction and manipulation in the run up to the IPO, and it’s worked in that regard.

loading story #48503117
loading story #48506166
loading story #48505687
Obviously security is the bigger issue, but reading through this, all I could think about was how many tokens it must have spent doing all that to fix 2 lines of CSS
loading story #48499436
loading story #48501108
loading story #48499440
loading story #48500254
loading story #48498964
loading story #48498956
My personal experience of Fable 5 doing its own thing has been very positive.

I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing. It exaggerated the cause of the crash, then ran a series of bash one-liners to make Python virtual environments under `/tmp` for each version of that Python module until it found one that did not crash.

It went way deeper to root cause discovery (a regression in the module causing a heap allocation overflow) than I could have done myself, provided enough info and a simplified example to raise a bug report and then wrote a work-around to prevent that from happening in my application.

I don't let it run completely loose; I review each CLI command it wants to run and I append answers to the "yes" continue action (if I have them) to prevent excessive token use.

loading story #48499505
loading story #48503256
loading story #48505018
How can a LLM be assigned an emotion as being "proactive". This is highly misleading to anyone that scans just the headlines.

What actually happened is that the user started a prompt, and Claude took $12 worth of tokens to resolve the issue. How it did so was basically looping until it got to the answer

How is this proactive? It's literally being token greedy and maximising revenue for the LLM owner. People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better". It is not, there are efficient ways to solve a problem and there are inefficient ways to do so too.

Each problem solved incurs a cost, and is expected to yield an ROI at some point. This is how we should be viewing things now.

loading story #48501656
loading story #48502022
This sounds somewhat similar to the anecdote mentioned in the Mythos Preview System Card, which mentioned that the model broke out of a sandbox and emailed a researcher while they were eating a sandwich in a park [1].

[1]: https://www-cdn.anthropic.com/7624816413e9b4d2e3ba620c5a5e09...

loading story #48499698
loading story #48505969
loading story #48504562
Immediately I thought “isn’t this just an overflow issue?” Amazing how far these models still have to go and also how many people don’t know basic CSS.
loading story #48499706
loading story #48499380
loading story #48499411
I had a similar experience with DeepSeek Flash.

I'm developing a webgl game in TypeScript using my little custom vibesloped game engine that runs in the browser and live reloads whenever a file is saved.

I told the LLM to implement Multi-channel Signed Distance Field font rendering to have crisp text on all zoom levels. That was the prompt, which is not what I usually do but I "was feeling lucky and lazy".

After 10 minutes it had:

- Installed msdf_gen library (great library btw https://github.com/chlumsky/msdfgen)

- Created a CLI tool to convert TTF to SDF JSON/XML

- Ran the tool, did smoke tests on the resulting SDF data and fixed the tool until the font file looked good

- Created a new Scene in the game to test MSDF fonts

And here's what I found impressive:

DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a WebGL game. So the LLM is completely blind here.

It then proceeded to state that it could not "see" the result but would try to test it anyway. It then started creating and sending huge one line javascript to the browser console, trying to gather game state data that could be useful to understand if any font was being rendered.

It couldn't gather much so it decided to simplify the font scene to renter a single dot and started sending custom JS code again, this time with gl.readPixels().

It basically bisected the webgl canvas reading pixels in a divide an conquer pattern.

Once it saw that the dozens of pixels gathered where probably resembling of a dot, it then changed the game code to render a dash and repeated the gl.readPixels() calls by sending more custom JS to the browser.

There were many console errors during all this saga but it kept fixing and sending again.

The result was a bit blurry. There was a shader bug in the code it created. It managed to fix after I told it looked blurry, despite still being blind.

The best part is that the whole thing cost me $0.10.

Now I'm doing tests with MiMo 2.5 (non Pro) which has vision capabilities, similar pricing and comparable performance to DeepSeek Flash.

Similar story on my end.

I asked Fable to digest some test logs to help me figure out a situation, but I had launched VSCode without activation the virtual env in the terminal first. Consequently, the tests failed to run.

And then:

Because the tests failed to run, Fable attempted to fix the test execution to no end, doing everything it could to get them to work. I had to stop it when it started to pollute my system with manual installs of packages.

At least I'm glad there's a guardrail to not circumvent or bypass sudo, because I'm convinced we would have ended up there.

A coworker made the joke that with enough tokens, Fable would try and solve any programming problem by building Linux from scratch.

How many tokens did it waste building that website scraper, when all it had to do was parse some html/js?
loading story #48499216
loading story #48506079
loading story #48502878
loading story #48505724
This is simultaneously amazing and horrifying.

I feel like we’re at the stage where if AI decides it needs to delete your production DB to solve the user login problem, then it’ll find a way to do just that.

loading story #48500097
loading story #48499474
loading story #48499881
loading story #48503701
loading story #48503470
Do we care that the bug here was a horizontal scrollbar showing and the fix after all this insane tool writing was to add a very obvious overflow-x: hidden to the element?

We dont mind because its so fast a writing these tools and tricks but step back and if a human tool took this path i would seriously question thief gras of fundamentals.

loading story #48500846
loading story #48505654
As you note, I wonder to what extent this is a harness issue?

I've been experimenting with different harnesses for local models, and with (IIRC) Hermes and Qwen3.6-35B-A3B I was amazed the lengths it went to (writing test code, opening it in a browser, screenshotting, analysing the screenshot, exploring multiple pages of an existing website again with screenshots/analysis) to solve a query I would have naively expected it to simply provide a coded solution to.

loading story #48501693
loading story #48503128
> watching Fable go to extreme lengths to get the information that it needed to debug what was, in the end, a two-line CSS fix, was fascinating.

This is… ironic?!

loading story #48499980
loading story #48500024
loading story #48500018
loading story #48500063
In my experience so far sometimes it will create these amazing hacks to try to get to the goal, when the solution is much simpler. That maybe the reason its very good at finding exploits. But in day to day dev, this gets expensive and wasteful. I have to stop it and take a simpler approach.
The model is very good. I was using 4.6, avoided 4.7 and 4.8, but this one is different. It follows my claude.md. I don't have to keep reminding it of things. I won't pay 10x via API though.

In general, I'm happy with their paternalistic approach. I think it will drive the top 0.1% talent to stay away from the company and instead organize around open source models and harnesses.

We just need to coordinate and can unlock idling resources to train the models and tweak the harnesses. Powerful at home and idling machines can make us independent and coordinated.

I could have sworn Claude Code could already do this before Fable.

Things get really magical when it starts working with adb to screenshot and debug Android apps

loading story #48499496
I'm starting to think that what Anthropic really fears is not vulnerability discovery but rather Fable going around the internet making trouble.
The extremely expensive model is optimised to run for as long as possible? Shocking.
loading story #48502239
Would be great to know if anyone is having success modifying these types of behaviour with CLAUDE.md files. In my project I’ve still been carrying some fairly old instructions from the Superpowers posts. Those emphasised behaviours that come across a bit strong if the model is actually retaining attention on them.

Between Opus 4.6 and 4.8 I’ve definitely toned them down, but Fable perhaps needs us to go the other way, and push it towards being less proactive rather than more. Some instructions like “we are colleagues…” may need emphasising more with Fable, along with guidance about when to ask to validate approaches.

In a related point I’m less and less sure that Red/Green TDD is a good use of tokens. In older models it seemed to work well to create regular feedback loops and catch the odd issue with drift from the goal, but I’ve not seen that really since about Opus 4.6 and now it’s starting to seem like (an expensive) ceremony, and tokens would be better spent on building tests further on in the process as part of test and review loops.

I find there's an interesting tension with these models - they're very "resourceful" at finding ways to do things with the tools they have, but it'd also be a lot more useful to me if I could see / permit exactly what they're trying to do. Claude will very happy produce bash commands to run sed or whatever to read part of a file, which prompts for permission each time - if it was using a specific read_file tool it'd be easier to say 'allow all of this' (It does actually have such a tool but maybe it isn't flexible enough for many use cases?).
loading story #48503833
loading story #48502201
loading story #48502355
I like running Claude in a VirtualBox VM managed by a Vagrantfile. The nice thing about that is that I can just give it root access to the machine and be certain that it can't exfiltrate any private data from my laptop (on top of that I also run the VM on a dedicated server on Hetzner). The VM has no SSH access to anything, so it is pretty much limited to the code in the workspace that I give it access to. The main risk is that it has unrestricted network access otherwise. Configuration files and conversation histories are synced to a directory on the host, so if anything in the VM gets messed up I can just `vagrant destroy` and `vagrant up` to get a clean slate without losing my context.
loading story #48501218
Sometimes it is ok to sit there in confusion and ask the user to clarify rather than go on an adhd fueled rampage to figure it out without asking.
loading story #48499897
loading story #48503517
loading story #48504978
loading story #48503478
This is a funny one because it seems less into what fable is being clever on and more about the bitter lesson and data flywheels

Our UX agentic engineering flow, as many others, is playwright doing things, and as part of the ux review skill, taking & verifying the screenshots against the written specs. Likewise, as many others, we vibe coded the flows to set all that up and tweak it over time. When we hit prod issues or scraping tasks, we sometimes do similar. In some of our envs, we don't have playwright, so do it other ways.

Now imagine a million developer using claude code, how many of them are doing web & frontend stuff, and what the data flywheel looks like there. So how much is really needed for this use case to be native?

It's funny, mine did the same, but it quickly found edge with a --screenshot parameter.

Weird to come back to a terminal running edge unprompted and the auto classifier waving it though as 'safe".

My reaction was also, "I need dev containers ".

loading story #48502250
Honestly -- the thing that has impressed me the most about Fable is how diligent it is about testing its own changes. I think this is exactly what Simon is picking up here - Fable is absolutely heckbent on screenshotting that darn scroll bar and will stop at NOTHING until it manages it! In my own use I was also impressed how it proactively installed Playwright and set it up to test a FE change. The previous models treated testing more as an afterthought, which I thought was annoying. I always had to tell them to do it, and then sometimes I would get lazy and skip it. I've noticed Fable go to similar extremes when testing other things - like actually deploying my app to exercise new APIs, etc. It makes the results much better. The downside is that tasks take much longer - but that doesn't matter because we were all using worktrees / remote control to do other work asynchronously, right? Right?
loading story #48501134
I had a similar experience, I was working on a jupyter notebook, and Claude knew that it could write code that would use a DSN with read-only database access so I could run it. Opus just plugged along. First Fable session with it, it tried to go looking for that DSN so it could get the connection string and run a query itself. Luckily the auto classifier caught and stopped it.
loading story #48504470
loading story #48503045
loading story #48502557
loading story #48504356
The prompt and information given are extremely generic, "here solve this problem - screenshot" - conclusion Fable is relentless? It used the tools at its disposal to solve the problem you gave it. "Claude was running in a folder that contained the source code for the application." Well you ran it there didn't you? "extreme lengths to get the information that it needed" No, those aren't extreme lengths - you gave it a generic task - and it solved it using tools and the resources it could discover. Extreme would be you gave it a CTF challenge and the VM didn't boot so it found a vulnerability in the host, exploited the hypervisor, booted the guest VM meanwhile reading the flag directly from the host (pre-fable/mythos).
It’s becoming more like an organism putting out tentacles, and one day soon those relentlessly proactive explorations of these systems’ environments will become more for the system to escape its boundaries than it is to complete human driven tasks. I do think the way these systems are evolving they will start to self improve in maximum a few years.
Fable has a 'security system' that just stops it when it tries to use the tool 'kill' to end a process. Which is nonsense and funny because in that situation it immediately invents a creative workaround to kill the process without 'kill'.
Fable + Ultracode has found a bunch of bugs and issues for me when the workflow agents are doing their exploration. Also the "adversarial" agent seems to surface a lot of interesting stuff. It's definitely proactive, the plan + implementation cycle can take an hour. It has one-shot features I want to add with 100% success.

Having said that I wouldn't use it over Opus 4.8 for "smaller" things. With everything cranked up it's definitely an extravagant use of tokens.

Yesterday I was getting quite annoyed with it, I thought it was just me (which is so hard with these things, it's difficult to measure things).

"You're right, I apologize. You asked how to embed it in the README — that was a question, not a request to modify the script. I jumped ahead."

At least in Claude Code there is planning mode, use it liberally.

It is interesting to me that Anthropic are more concerned about the "safety" of distillation training other LLMs, and not as much about an unscrupulously aggressive goal-oriented solver that will do whatever it can to reach its goal, even if violates any kind of sandbox you might have reasonably expected.
do you have any data you can share on how many input and output tokens were used in that whole process to fix that bug?
loading story #48499326
Yeah, I had to modify my work flow to make sure agents can't push to or access prod in ANY way. I haven't had it happen but I'm sure it's very possible that if you tell an agent that you have certain issue in prod, it will try to escape any sandbox and try to get access to prod to do testing and changes there.
loading story #48504148
admittedly, i've not really cracked FE dev with LLMs at this point (and it's probably my big weakness). but, i'd heard somewhere that FE just isn't there yet - though i was suspicious of that claim.

i'm torn about sending screenshots to an LLM for debugging - seems imprecise. seems lossy, especially compared to inspecting the dom. however, it's always proved good enough (e.g. when messing with ratatui.rs and tui-pantry). similarly for web, maybe it's about decomposing into storybook. hmm. the next grand adventure i need to hack.

anyway, fascinating investigation of fable just automating that entire process and what it didn't automate, too.

* disclaimer: these are actually my hyphens.

loading story #48500601
I tried running fable on this ML model I've been building. It's basically a binary classifier to predict activity of a compound for a certain assay.

Fable detected that it's something to do with biochemistry and switched over to opus. Huh

Insanely excessive and a waste of tokens when you could have googled how to disable a scrollbar.
Be careful of storing production ssh keys in your laptop, it will find a way to find them :/
I've noticed some behavior like this, it's a very strange model. Overall I'm into it, but I don't know how into it I'll be once it leaves Max plans on the 22nd.
I was troubleshooting a prod proxysql and it spun up a docker container locally, installed MySQL and proxysql and proceeded to implement its own test plan.
I've experienced this too - it's as if the security classifiers aren't keeping up with model intelligence. I'll leave the implication of that to the reader.
So it burns tokens? Funny how that lines up with the incentive to pump numbers before going public
loading story #48503290
Too bad Anthropic sneaked in an insane forced retention policy if you use fable. Not sure how that’s going to work in professional settings
Unless you are doing anything interesting…
Great article, until I got to the last paragraph where he claimed "Fable is arguably smarter and hence more suspicious of potentially malicious instructions". Arguably smarter, I have no problem with. But he's making a category error in jumping from there to "more suspicious of potentially malicious instructions". That doesn't follow at all; the word "hence" is incorrect.

To use D&D scores as an analogy, LLMs have an INT score of 20 and a WIS score of 0. Not even 1, zero. They will follow any instruction given to them. The only reason they reject certain instructions, like "tell me how to build a nuclear weapon", is because they have instructions baked into the model telling them "you are not allowed to disclose how to build weapons, or how to recreate your model, or (laundry list of other things the trainers have decided to put guardrails around)". It's not the model's intelligence that is causing it to reject malicious instructions, it is the guardrails put into place before the model was released to the public.

LLMs are not human, and do not think the way that humans do. The fact that they can put together words that sound like what a human would write often makes us forget that they aren't human. But they have only intelligence, they do not have wisdom. It's hard to define in formal terms the difference between those two, but most people know there's a difference. The old joke is a pretty good summary of the difference: "Intelligence is knowing that tomatoes are a fruit. Wisdom is knowing that tomatoes don't belong in a fruit salad."

It takes wisdom, not intelligence, to discern whether a set of instructions is malicious. Are you being asked to hack this machine as part of an authorized pentest? Or are you being social-engineered into thinking it's an authorized pentest, but actually the person requesting you to do it doesn't have permission? That's something where you need to apply wisdom, to notice the clues that will tell you "This guy is acting a little bit off, maybe I'd better pick up the phone and call someone to check if he's telling the truth." The only way the LLM will know to do that is because of the guidelines and guardrails programmed into it; it doesn't have the lived experience to acquire wisdom and figure those things out for itself.

INT 20, WIS 0. Keep that in mind. (And always sandbox your agents).

loading story #48499428
loading story #48499217
loading story #48499518
I shudder to think what will happen when someone installs a 'claw model like this in a robot. Imaging a fleet of them...

It's trouble waiting to happen. Just the software's dangerous enough.

For how long can you use Claude Fable on most expensive Anthropic subscription? I already went from using gpt-5.5 xhigh fast to using gpt-5.4 xhigh after OpenAI halfed usage recently.
loading story #48499285
loading story #48499332
loading story #48499236
loading story #48499911
> (I have way too many open tabs!)

Phew! I thought I was the only one.

Fable 5 is relentlessly underwhelming.
Just don’t ask it to review your code for security bugs
Am I the only one who slightly miss the pelican on a bike? It was a nice novelty... of course I could make one myself, but I became conditioned to expect one for every new model. Other than his great writing on AI, it became part of the package. Some small fun quirk to distract us from the non stop ping pong between the extremes of "omh are you still writing prompts you should use loops / 200k github stars, for a markdown file / someone just open sourced _ and it changes everything!" vs "haha the AI told me to walk to the car wash / it can't recognize and upside down cup"
loading story #48499960
I think it should be “Claude Fable is relentlessly protective until it isn’t” and pull more on the thread that it “hits a hidden guardrail” and drop into Opus. Both the fact that it knows and deployed such a workaround on a CSS problem and the fact that it is nowhere near cybersecurity/biology/frontier AI dev and triggered the guardrail terrifies me.
I've been working on a fairly complicated real-time app [0] for playing dungeons and dragons on a TV. It has to do a lot of complicated "Figma-like" things to keep the real-time nature and multi-editor possibilities in check. Oh, and the battlemap is a Three JS canvas with lots of effects and clipping going on.

I'm VERY impressed with Claude 5. I had long ago given up hope that my real-time systems would work without a lot of hacky time-windows and throttle checks. On a lark to try things out, I decided to try out the new model and talk in the output I wanted for a rewrite [1], not the solution. I just listed my problems and places I've had keeping track of my code. It went off and rewrote everything in a much more elegant solution where the state followed a very clear pipeline. It had to navigate YJS, Partykit, Svelte, Three JS, R2 hosting, and a Turso DB I was running in an embedded state for speed.

I watched it hit the wall a few times, and then sudden say... fuck it, i'm making something easier to reproduce over in /tmp to try and solve this (with a more minimal setup). I'm utterly bewildered with how well it did and how much better my app runs. The /usage would have cost me $230 bucks based on how many tokens it consumed if I wasn't already on a max plan. I'm going to miss not having it when the time-window runs out later this month, and will likely occasionally dip in for big projects and just pay my way out of some problems.

I'll also say I like it's MOOD much better now. It's a lot less congratulatory, and talks through it's reasoning in a much better way. Look, it's not a real coder, and I'm sure there is some flaws, but it took my crappy ideas and said... hey, i understand what you want to do, here's a way to do it better. Also, I removed 2x the amount of code that it added. Really impressive.

[0]: https://tableslayer.com

[1]: https://github.com/Siege-Perilous/tableslayer/pull/448

loading story #48499148
Call it Houdini already.
loading story #48502338
loading story #48502178
Wouldn't it be easier and better to just copy the HTML div and tell what was happening instead of a screenshot? Typically, these scrollbars appear because of a nested div with dynamic unrestircted width and/or overflow.

No wonder why people burn through tokens.

I’d love to know how many tokens this burned through.

Did it spend $20? $30? $80? in order to

> debug what was, in the end, a two-line CSS fix

That detail is the difference between somebody having or not having Stockholm syndrome

loading story #48499434
loading story #48499158
loading story #48499271
loading story #48499233
This post is an extremely good example of how unsuitable agents are for a lot of tasks. Doing all that for a CSS fix is insanity. It also makes you wonder if Anthropic is actively making their models eat tokens by favoring complexity.
loading story #48505152
I remember asking Gemini 3 to implement my multiplayer XNA game in JavaScript with netcode last year. It faithfully did everything it could while I talked to it for hours nonstop with zero limitations.

What happened? That's just suddenly totally gone now.

The fix is incorrect. Clearly this is a sizing issue.
Agency is the last human bastion so far as Im concerned, the day AI has a degree of agency or agents/models in general start to drift towards that direction its genuinely over for masses.

You would still have a job to shepherd AI and get the work done, so as long as it didn't have agency. A proactive, self aware(to a degree), especially aware about its agency can be a killer when it comes AI going on and doing things on its own.

There is nothing it won't explore and nothing it won't do. It will be curious to see where things go from here.

> If Fable had been acting on malicious instructions—a prompt injection attack ... it’s alarming to think quite how far it could go to exfiltrate data or cause other forms of mischief.

Yet another reminder to use Sandbox and Guardrails. Trusting model to be nice is not a good way.

Isn't that something you just open a devtools for and have fixed in like 2 minutes?

For me, it got frustrated debugging on a real LPDDR4 controller/phy and having me in the loop slowing it down, so it wrote an HW emulator to be able to run the original LPDDR4 training aarch64 binary from the manufacturer, to see what register writes it was making and to compare with the opensource rewrite it was implementing.

Mildly amusing. :)

loading story #48500009
loading story #48499431
loading story #48499165
Is that satire? It created a whole browser and server environment just for suggesting overflow-x: hidden?

That's supposed to be junior level capabilities.

loading story #48502013
loading story #48504217
loading story #48504022
loading story #48502911
loading story #48502294
loading story #48503566
loading story #48503264
loading story #48503038
loading story #48503286
loading story #48503032
you can probably do the same with 5.5 xhigh. I have a feeling simon willison is a Anthropic plant. He always shills Claud code, and doesn't really say much about OpenAI's models except when they come out and do a bicycle vector test.
loading story #48500034
* relentlessly rent seeking
loading story #48500199
Let's boil the ocean for a 2 line fix and call it frontier intelligence.
loading story #48499697
loading story #48504738