Get Shit Done: A meta-prompting, context engineering and spec-driven dev system
https://github.com/gsd-build/get-shit-doneIt's working, and I'm enjoying how productive it is, but it feels like a step on a journey rather than the actual destination. I'm looking forward to seeing where this journey ends up.
I reviewed the code from both and the GSD code was definitely written with the rest of the project and possibilities in mind, while the Claude Plan was just enough for the MVP.
I can see both having their pros and cons depending on your workflow and size of the task.
It is perhaps confirmation bias on my part but I've been finding it's doing a better job with similar problems than I was getting with base plan mode. I've been attributing this to its multiple layers of cross checks and self-reviews. Yes, I could do that by hand of course, but I find superpowers is automating what I was already trying to accomplish in this regard.
A mess. I still enjoy superpowers brainstorming but will pull the chute towards the end and then deliver myself.
If I fork out a version for others that is public, then I have to maintain that variation as well.
Is anyone in a similar situation? I think most of the ones I see released are not particularly complex compraed to my system, but at the same time I don't know how to convey how to use my system as someone who just uses it alone.
it feels like I don't want anyone to run my system, I just want people to point their ai system to mine and ask it what there is valuable to potentially add to their own system.
I don't want to maintain one for people. I don't want to market it as some magic cure. Just show patterns that others can use.
My findings:
1. The spec created by Superpowers was very detailed (described the specific fonts, color palette), included the exact content of config files, commit messages etc. But it missed a lot of things like analytics, RSS feed etc.
2. Superpowers wrote the spec and plan as two separate documents which was better than the collaborative method, which put both into one document.
3. Superpowers recommended an in-place migration of the blog whereas the collaborative spec suggested a parallel branch so that Hugo and Astro can co-exist until everything is stable.
And a few more difference written in [0].
In general, I liked the aspect of developing the spec through discussion rather than one-shotting it, it let me add things to the spec as I remember them. It felt like a more iterative discovery process vs. you need to get everything right the first time. That might just be a personal preference though.
At the end of this exercise, I asked Claude to review both specs in detail, it found a few things that both specs missed (SEO, rollback plan etc.) and made a final spec that consolidates everything.
Another great technique is to use one of these structures in a repo, then task your AI with overhauling the framework using best practices for whatever your target project is. It works great for creative writing, humanizing, songwriting, technical/scientific domains, and so on. In conjunction with agents, these are excellent to have.
I think they're going to be a temporary thing - a hack that boosts utility for a few model releases until there's sufficient successful use cases in the training data that models can just do this sort of thing really well without all the extra prompting.
These are fun to use.
Get Shit Done is best when when you're an influencer and need to create a Potemkin SaaS overnight for tomorrow's TikTok posts.
It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.
Yes this is how much paying FreshBooks annoyed me. Plus I hated they forced an emailed 2FA if you didn’t connect with Google.
The best way I have today is to start with a project requirements document and then ask for a step-by-step implementation plan, and then go do the thing at each step but only after I greenlight the strategy of the current step. I also specify minimal, modular, and functional stateless code.
I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.
I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.
I think these type of systems (gsd/superpowers) are way too opinionated.
It's not that they can't or don't work. I just think that the best way to truly stay on top of the crazy pace of changes is to not attach yourself to super opinionated workflows like these.
I'm building an orchestrator library on top of openspec for that reason.
https://zarar.dev/spec-driven-development-from-vibe-coding-t...
For this reason I don’t think it’s actually a good name. It should be called planning-shit instead. Since that’s seemingly 80%+ of what I did while interacting with this tool. And when it came to getting things done, I didn’t need this at all, and the plans were just alright.
But what makes a difference is running plan review and work review agent, they fix issues before and after work. Both pull their weight but the most surprising is the plan-review one. The work review judge reliably finds bugs to fix, but not as surprising in its insights. But they should run from separate subagents not main one because they need a fresh perspective.
Other things that matter are 1. testing enforcement, 2. cross task project memory. My implementation for memory is a combination of capturing user messages with a hook, append only log, and keeping a compressed memory state of the project, which gets read before work and updated after each task.
[1] https://www.riaanzoetmulder.com/articles/ai-assisted-program...
This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.
[1] https://github.com/ChristopherKahler/paul
[2] https://github.com/ChristopherKahler/paul/blob/main/PAUL-VS-...
If it was game engine or new web framework for example there would be demos or example projects linked somewhere.
I'm facing increasing pressure from senior executives who think we can avoid the $$$ B2B SaaS by using AI to vibe code a custom solution. I love the idea of experimenting with this but am horrified by the first-ever-case being a production system that is critical to the annual strategic plan. :-/
I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.
You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.
All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.
Saying "I generated 250k lines" is like saying "I used 2500 gallons of gas". Cool, nice expense, but where did you get? Because it it's three miles, you're just burning money.
250k lines is roughly SQLite or Redis in project size. Do you have SQLite-maintaining money? Did you get as far as Redis did in outcomes?
Rather than having agents decide to manage their own code lifecycle, define a state machine where code moves from agent to agent and isolated agents critique each others code until the code produced is excellent quality.
This is still a bit of an token hungry solution, but it seems to be working reasonably well so far and I'm actively refining it as I build.
Not going to give you formal verification, but might be worth looking into strategies like this.
We built AI code generation tools, and suddenly the bottleneck became code review. People built AI code reviewers, but none of the ones I've tried are all that useful - usually, by the time the code hits a PR, the issues are so large that an AI reviewer is too late.
I think the solution is to push review closer to the point of code generation, catch any issues early, and course-correct appropriately, rather than waiting until an entire change has been vibe-coded.
My rant about this: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
Things have changed quite a bit. I hope you give GSD a try yourself.
It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.
Makes sense for consistency, but also shifts the problem:
how do you keep those artifacts in sync with the actual codebase over time?
I've been down the "don't read the code" path and I can say it leads nowhere good.
I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"
I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.
Like most spec driven development tools, GSD works well for greenfield or first few rounds of “compound engineering.” However, like all others, the project gets too big and GSD can’t manage to deliver working code reliably.
Agents working GSD plans will start leaving orphans all over, it won’t wire them up properly because verification stages use simple lexical tools to search code for implementation facts. I tried giving GSD some ast aware tools but good luck getting Claude to reliably use them.
Ultimately I put GSD back on the shelf and developed my own “property graph” based planner that is closer to Claude “plan mode” but the design SOT is structured properties and not markdown. My system will generate docs from the graph as user docs. Agents only get tasked as my “graph” closes nodes and re-sorts around invariants, then agents are tasked directly.
I think I have to get my head around a lot more than I think
Looked at profile, hasn't done or published anything interesting other than promoting products to "get stuff done"
This is like the TODO list book gurus writing about productivity
I got a promotion once for deleting 250K lines of code in less than a month. Now that sounds better
Faster than using ai. Cheaper. Code is better tested/more secure. I can learn/build with other humans.
There is a gsd-plan-checker that runs before execution, but it only verifies logical completeness — requirement coverage, dependency graphs, context budget. It never looks at what commands will actually run. So if the planner generates something destructive, the plan-checker won't catch it because that's not what it checks for. The gsd-verifier runs after execution, checking whether the goal was achieved, not whether anything bad happened along the way. In /gsd:autonomous this chains across all remaining phases unattended.
The granular permissions fallback in the README only covers safe reads and git ops — but the executor needs way more than that to actually function. Feels like there should be a permission profile scoped to what GSD actually needs without going full skip.
That's not a reason to stop trying. This is the iterative process of figuring out what works.
If I remember correctly, it created a lot of changes, spent a lot of time doing something and in the end this was all smoke and mirrors. If I would ever use something like this, I would maybe use BMad, which suffers from same issues, like Speckit and others.
I don't know if they have some sponsorship with bunch of youtubers who are raving how awesome this is... without any supporting evidence.
Anyhow, this is my experience. Superpowers on the other hand were quite useful so far, but I didn't use them enough to have to claim anything.