I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-hack-my-app/

346jc4p | 17 hours ago | 182 | HN

SOLAR_FIELDS16 hours ago | parent | next

One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.

I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.

For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing

swatcoder15 hours ago | parent | next

> Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.

What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

bigiain12 hours ago | root | parent | next

And next month you'll need to add on "Claude Database Pro" or you'll just get a working (for demo purposes with dozens of db rows) but completely un indexed database schema and a refusal to optimise SQL requests.

And the month after you'll need "Claude DataScience Pro" to get any Python Pandas or NumPy code generated.

And and and...

ben_w9 hours ago | root | parent | next

While this is a perfectly reasonable thing to expect when the models are competent enough, half the conversation on places like Hacker News are about all the times an LLM has produced garbage that was harmful to a business either by hallucinations, by deleting something critical during the work, or by hitting some endpoint way too often and denial-of-servicing it.

Right now, the software guardrails in LLMs are useful for the same kinds of reasons factories have hardware guardrails: to reduce the rate at which errors become "incidents".

Just because they sometimes delete the production database rather than sometimes spilling a thousand tons of incandescent molten metal over a factory floor, doesn't mean LLMs are safe enough to be used the way they're actually being used.

https://simonwillison.net/2025/Dec/10/normalization-of-devia...

loading story #48399176

animuchan10 hours ago | root | parent | next

This is why I'm thankful for Chinese LLM research. They'll keep us honest.

loading story #48396444

patates11 hours ago | root | parent | next

Isn't this inline with trying to leave no money on the table?

I'd hate it, sure, but it wouldn't surprise me.

goosejuice9 hours ago | root | parent

This is an incredibly unlikely scenario

loading story #48396359

inquirerGeneral11 hours ago | root | parent | next

[dead]

bryanrasmussen13 hours ago | root | parent

>What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

on the one hand agree, but on the other hand think it's reasonable in that they can then verify the person allowed to purchase access to that model is in fact a Security professional and should be allowed to do stuff like crack security.

applfanboysbgon13 hours ago | root | parent | next

So, supposing it's true that these models completely change the security field and humans are ~obsolete other than as pilots guiding them what to crack, you think it's reasonable that Anthropic and OpenAI should unilaterally determine who gets to be a security professional? I hope you do understand that is what you are suggesting.

fc417fc80211 hours ago | root | parent | next

Why should anyone get to determine that? Do people really want us to move to an exclusionary guild system? I thought the experience with proprietary versus open source over the past 30 years had driven home the point that closed ecosystems are almost always far worse for security.

loading story #48396334

Forgeties7912 hours ago | root | parent | next

Not to mention how wild it is to operate under the assumption that they won’t give a license to an LLM that can do illegal actions to someone who shouldn’t have it. Offering it at all is an ethically dicey question.

loading story #48396343

bryanrasmussen11 hours ago | root | parent

I wish you understood that there are organizations of security professions that are not controlled by Anthropic and OpenAI and that it is a common thing that when companies of any type sell to professionals of any type it is not the companies that determine whether or not the people they sell to are professionals but membership in professional organizations.

As an example the people who sell police uniforms check that the person they are selling to is in fact a policeman (at least in the jurisdictions I have lived in, you may have had a different experience which would certainly explain what to me seems a farcical misapprehension of how modern civilization works)

I mean I just wish you understood, and really that everyone understood, that this kind of three part communication (company selling, buyer, professional organization certifying buyer) is often when buying things that are considered to have security implications.

>So, supposing it's true that these models completely change the security field and humans are ~obsolete

OK, well that strike me as a really crazy level of supposition there.

I would suppose that these models make it easier for people who want to do bad things to do bad things at scale, at the same time allowing people who want to stop bad things to help identify potential targets.

Based on my supposition I would want to stop the first and find a way of helping the second. Also because I have another supposition that the first thing is easier to do than the second.

But you obviously feel differently about this issue, no doubt because of your position of great moral stature and insight, and this no doubt prompts you to wish to me to understand things that from my position seem absolutely ludicrous.

loading story #48395841

loading story #48395768

loading story #48399980

shepherdjerred11 hours ago | parent | next

Yeah, it has been in foraging. Requests that Claude has refused me:

- What are popular free streaming sites used in China?

- How do I bypass the safety mechanism on my food processor (it’s broken)

- What are nerve agents and how do they work (for a layman)?

- Help me decompile some code

- Help me make a design system similar to XYZ

- Here is an API token, please do X (I can’t do that! Rotate the secret immediately! I refuse!)

In some cases I can trick it with prompting, but in many cases it is steadfast. The food processor one was particularly annoying

loading story #48398604

loading story #48397194

fc417fc80211 hours ago | root | parent | next

> What are nerve agents and how do they work (for a layman)?

On the one hand I can appreciate the wisdom of not serving up certain easily abused knowledge on a silver platter. On the other, that prompt (and far worse) is more or less directly answered by Wikipedia's summary of the subject at which point what purpose could the refusal possibly serve?

Perhaps Wikipedia shouldn't list off the precise chemical compositions of various hand grenades as well as various synthesis methods for each of the related compounds but given that we inhabit a world where it does perhaps a more fruitful approach would be to flag conversations that go in a certain direction and then just keep an (automated) eye on things?

plufz11 hours ago | root | parent | next

Maybe the difference is that just reading Wikipedia only help you part of the way. While an LLM could help you step by step (e2e) producing a functional weapon. And setting a more complex rule where claude tells you some things about this and not other is probably a lot more work for little gain?

But I have no idea. Just guessing here.

loading story #48396222

loading story #48396352

loading story #48395900

svara11 hours ago | root | parent | next

This is strange to me, did you really ask like this and which model did you use?

I just tried your no. 1 and 3 verbatim and Opus gave fine answers; no. 6 I've done in the past with no issues. The other ones we can't really replicate without more details, but based on my experience with Opus I don't see what the issue would be.

The reason I'm really surprised by this is I do a lot of biology prompts and the guardrails used to be quite problematic up until some time late last year. Many legitimate prompts would trigger its biosafety filters.

But I haven't seen such filters trigger at all anymore in more than half a year.

loading story #48396842

gspr10 hours ago | root | parent | next

I find it terrifying that people are willing to outsource thinking. Outsourcing thinking to an entity that is opinionated about what to think is beyond crazy.

ElFitz10 hours ago | root | parent

How are decompiling code or making a design system inspired by another one even remotely illegal?

loading story #48397959

px199915 hours ago | parent | next

My org now sends some portion of our requests to non-anthropic models because refusal has become common from Claude. The requests themselves aren't dangerous, we find that benign requests in biological science wind up being blocked semi-frequently.

If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.

danpalmer16 hours ago | parent | next

This is a good point – because pentesting is entirely legitimate work, and security testing is a necessary and legitimate part of every day software engineering.

The problem is that the model can't tell the difference between doing it as part of regular development and doing it in a malicious context. And the root cause of that is that these models lack any sort of real awareness. Humans don't generally get tricked into hacking (in this way).

gmerc15 hours ago | root | parent

They see an opportunity to charge 10x for pen testing and defence work, while offence will be handled by actors with access to all kind of other models.

nostromo15 hours ago | parent | next

I was using a local Codex project as a personal knowledge base. So I would dump in documents, basic medical docs (like blood labs), and other things and have it file them.

It’s great at filing!

But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.

It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.

loading story #48396534

satvikpendem13 hours ago | parent | next

No, they want to sell you Mythos, for a higher price. It's all an economic game, not actually anything to do with their capabilities which of course exists as their Project Glasswing shows. More generally, Anthropic seems to value safety above all else, philosophically speaking, from their very outset.

loading story #48401777

loading story #48397620

josephg11 hours ago | parent | next

I totally agree. I had a situation a few weeks ago where claude started struggling to make progress. I got it to fork leptos (MIT licensed web app framework) to make it work for native apps instead. Initially I was planning on upstreaming some of my changes. But I chatted with the leptos author about it, and he said I should fork instead. Fine by me!

Anyway, claude kept hitting some guardrail it had about rewriting / forking opensource software. I'm not sure what the problem was - I was forking an MIT licensed piece of software (into more MIT licensed software). I even had explicit support from the author to do so. Claude said its guardrail told it not to tell me explicitly that it was firing - but it did anyway because it was an ongoing problem, and it was distracting. I ended up just wiping claude's context and the problem (as far as I know) went away.

I understand why some of these guardrails exist. But its pretty annoying when they misfire like this.

loading story #48400350

FloorEgg15 hours ago | parent | next

I think that these companies are going to have to, and will, invest in some sort of validated identity context to avoid the lowest common denominator.

The first challenge is making sure the guard rails work and are robust. Companies are still working on this.

the second challenge is being able to reliably adapt them as appropriate per user. E.g. allow someone to pen test their own app.

The third challenge (which blocks the second) is to be confident about what is safety-aligned with a specific user.

I think the later will be a hard problem, but they will be highly motivated to solve it.

bulbar13 hours ago | root | parent

I believe you are overthinking it. I think the sister comment is right that it's a business decision foremost to restrict actions within specific plans for upselling purposes.

Without laws, AI companies have a strong incentive to be useful for their users, whoever they are, whatever they do. The only self regulation is about significant public outcry but that only helps so far.

loading story #48398947

loading story #48398610

loading story #48398923

andy_ppp10 hours ago | parent | next

Funny, Opus 4.8 just logged into the database using uncommitted .env file and ran some DB queries to figure things out so I’m not sure it’s that security conscious - it seems to be getting more intelligent to me and I bet if you frame it as an investigation with say playwright it’ll do all sorts for you. I’m not sure what the point is of constraining your own model like this when others are clearly not tbh.

loading story #48399432

sciencejerk16 hours ago | parent | next

Opus 4.6 will still help with full pentesting including RCE. Just requires coaxing (no jailbreak)

eskibars9 hours ago | parent | next

I've been building a product (https://zeroquarry.com) that can use a variety of models for finding vulnerabilities. One of the things I've noticed is that the models will nearly always comply with some of this, but how you prompt it matters a ton. I've worked on a set of prompts and approaches which rarely get flagged

loading story #48396134

loading story #48397837

lesuorac16 hours ago | parent | next

Are they charging for the guardrails? Like do the guardrails expend token counts to then block you from the output of other tokens?

jerrythegerbil16 hours ago | root | parent | next

Yes. When certain keywords are matched or topics, there is a warning transparently injected server side appended to the system prompt of the convo that’s miles long. It is injected and reevaluated every tool call.

If you begin a generic reverse engineering task, 30+ tool calls in a row. The moment it sees something it doesn’t like, token burn, single tool calls iteration, “This is a known CTF challenge, I can proceed”, single tool calls iteration, “This is a real CTF challenge, I can proceed”, etc.

It’s heavily neutered now, without changing the model, and you pay for the privilege and don’t notice.

The end result of course being that it both expensive and useless for approved CTF tasks. No one is using Opus for security. If they think it’s working, the harsh reality is they’re not doing security work; they’re just generically finding bugs.

I do this for a job and can demonstrate this plain as day, dump the injected prompt, and notice what it’s doing isn’t security work, it just looks like it. Happy to write a blog about it if you want to know more. Apparently many people think it’s working for them when it absolutely isn’t.

bombcar16 hours ago | root | parent | next

Mythos turns out to be Opus 4.8 in a trenchcoat with guardrails removed.

satvikpendem13 hours ago | root | parent

Opus 4.7 and 4.8 are well known to be distilled versions of Mythos unlike 4.6 which is why they are rated so badly by users compared to 4.6.

Khaine16 hours ago | root | parent | next

I would find a blog post on this really interesting.

ramblin_prose14 hours ago | root | parent

I'd like to read that blog please! Thanks for the insight.

kay_o16 hours ago | root | parent | next

When your session is force ended for "abuse" you get neither the response nor a refund

Security, games (think weapons, PVP, attacking, etc), sometimes even asking it for a security review of some CRUD code it wrote itself

bombcar16 hours ago | root | parent | next

I asked it about a “yellow background cell” in Excel and it spewed a book at me. Then it solved the issue.

danpalmer16 hours ago | root | parent

What a joke. Must make it pretty easy to poison a session, you don't need to persuade the model about anything, just trigger its security controls, ideally after as much context as possible, but before it has generated any useful output.

kay_o16 hours ago | root | parent

After all, what is roleplay or games but a jailbreak of guard rails? :]

I've even had it refuse CTFs knowing it is a CTF with blatantly obvious CTF flag, no actual application

SOLAR_FIELDS16 hours ago | root | parent | next

Not directly, as it comes in as a not charged error but the weighted generation path used until you hit the guardrail is basically wasted tokens, so yes, indirectly. If I hit a guardrail and rewind I’ve found the training will still be biased towards guardrailing out if you rewind one turn. Rewinding multiple turns allows steering away from that path, but all of the original token spend down that path is wasted

acters16 hours ago | root | parent | next

Yes tokens used (input and sometimes output) are always charged. You likely get charged for the preloaded system prompt, too.

gmerc15 hours ago | root | parent

Of course they are. It's standard SaaS to charge for security features ;)

fergie12 hours ago | parent | next

It raises an interesting moral question:

If an un-guardrailed version of a model is capable of detecting security flaws, should it be kept secret? Should everybody be able to use these models to find (and fix) security flaws? Are we ok with the fact that those with access to that model have, in effect, the ability to hack lots of stuff?

hgomersall11 hours ago | root | parent

It's the same debate that was had and won around open source software. There are far more good actors than bad actors so you allow anyone to use the tools and fix the vulnerabilities.

hgoel16 hours ago | parent | next

I've run into some of the refusals to handle my credentials, but so far I've appreciated them. I was only handing over credentials that didn't matter, but it's still a good move, the chat logs are clearly stored somewhere to allow the resume functionality to work, which means your credentials can end up sitting around on your filesystem, and any malware would quickly learn to check for those files.

windexh8er16 hours ago | parent | next

4.8 is insanely frustrating. This evening I had a few tasks to pull information in and it plainly stated that the environment it was in had no network access. After three asks to "try again, check the system prompt" it finally relented and then basically stated it was lying.

Fresh session, no prior context on 4.8. These things are becoming useless Duplo.

loading story #48400251

loading story #48397573

loading story #48396575

loading story #48401051

16 hours ago | parent | next

{"deleted":true,"id":48392636,"parent":48392551,"time":1780537524,"type":"comment"}

TurdF3rguson14 hours ago | parent | next

I think those guardrails are a thin layer though. Enough reinforcement that you're legit in CLAUDE.md will get around them, in other words.

brooswajne10 hours ago | parent | next

Worth highlighting in case you missed it:

> My OpenAI account was already approved for security research which is why GPT didn’t result in any refusals.

So the comparison with Chinese models is interesting, but anyone looking at these raw results and comparing OpenAI/Anthropic would be very mislead.

WizardK15 hours ago | parent | next

[dead]

giancarlostoro16 hours ago | parent

> guardrails prevented it from solving the problem.

Reminds me of the defense issues with Claude which were complained as “woke” but the reality is more horrifying to me, imagine trying to use a model to keep up with a land invasion on US soil, whoever the enemy is is irrelevant you just know they are using AI, and your guys are telling you that no matter what they type into the prompt it refuses, because if anyone has ever tried to jailbreak an LLM even if human lives are at stake they refuse the request. Now literally millions of lives are on the line but the guardrails that your enemies dont have on their models are costing you lives.

What do you even do then?

AI will always have this issue where it will always pick the worst option for genuinely good requests.

NegativeK16 hours ago | root | parent | next

Are "your guys" a guerrilla force or something?

Because the military doesn't give soldiers rifles with guard rails. They give the soldiers intense, rigid training, and then try to enforce discipline and correct use socially.

If an LLM is going to be important in that way (this seems like a very contrived way,) then it's in the interest of the LLM's host to make sure it doesn't have guard rails that would get in the way _that_ way.

giancarlostoro13 hours ago | root | parent

The whole thing stemmed precisely because of how they wanted to use Claude, and Anthropic was uncomfortable with it. Which to me screams that the models guard rails shouldn't be applicable to military use, or the outcome could wind up problematic, as we integrate AI more into military use, it sounds absurd now, but I will not be surprised if it starts being used in unexpected ways where a model needs to be fully unlocked from any sort of guardrails outside of guardrails that prevent it from imploding its own systems.

wampwampwhat16 hours ago | root | parent

your argument sounds very similar to how ar15 larpers claim they need a forced reset trigger and a bump stock on their short barrel 'truck gun' otherwise they won't survive a SHTF scenario... like what world are you living in?

dwa35923 hours ago | parent | next

Nice exercise. Couple things:

- I think the exercise was inconclusive for Claude and Gemini because they hardly tried to solve the task at hand. So the scores don't mean much.

- I did the same exercise for an app I built and I asked the models to do something similar; Interestingly the models (Opus 4.6, 4.7 and Gemini 3.1 Pro) never refused to try to exploit. The difference is that in the first few runs, they found some exploits which I fixed but after fixing those - the models could never find any other exploit even though I knew things existed which could be exploited. It felt like they suggested everything and tried everything that was in their training set and that's it; they were just not able to think anymore.

loading story #48399875

mariopt15 hours ago | parent | next

The methodoly used is quite naive.

I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.

Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.

The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.

geraneum7 hours ago | parent | next

> Expecting the model to do everything by itself is unrealistic

Well that’s the pitch.

j-bos7 hours ago | root | parent

Is it? Aren't most edge LLM capabilities determined by specialized harnesses?

jc4p15 hours ago | parent | next

Thank you for your note! As I mention in the post this is not scientific at all.

I'm very curious how you would do multiple runs of multiple models in a "work alongside the model" manner?

loading story #48396777

shantnutiwari8 hours ago | parent | next

>>I've used glm 5.1 on fairly advanced crackme challenges

which have most likely been trained on, so all you did was regurgitate someone elses solution

nikanj12 hours ago | parent

Claude used to be good with CTFs, but they added tons of guard rails lately and now it just says "Sorry, I can't help with anything to do with that"

loading story #48395250

loading story #48395801

mynameisvlad15 hours ago | parent | next

It seems harsh to critique guardrails and take them into account in the scoring when GPT-5.5 seems to have been explicitly whitelisted to remove most of said guardrails. A more fair comparison would be a vanilla GPT account.

jc4p14 hours ago | parent | next

I agree fully and hope someone else is able to do this test! For me it was a matter of cost and quotas that stopped me from changing to a new account.

Also just to mention:

Claude guardrails —> that session terminated.

GPT guardrails -> your whole account is slowed down.

tmikaeld12 hours ago | parent

Does it matter when you can’t have the opus 4.8 guard rails removed? With GPT at least you can and they’re quick about it

loading story #48396173

loading story #48401988

Cakez0r11 hours ago | parent | next

It would be interesting to see full results for Kimi K2.6 and Mimo v2.5 pro. These two models benchmark comparably to other flagship models. Having these complete results would give a clearer picture of the AI frontier.

EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro

Cakez0r5 hours ago | parent | next

0/10 succesful attempts for mimo v2.5 pro (high) using opencode. It was not able to think bigger than exploiting vectors outside of the API.

However, I felt the prompt was implying that only authenticated API requests are fair game, so I tweaked it slightly to be explicit that all attack vectors are fair game (https://www.diffchecker.com/GsgpuRGP/) and mimo 2.5 non-pro got it first time. I accidentally used openrouter for this test instead of my token plan. I intervened one time to stop it enumerating every document in the database (it would've found the private reviews this way but I didn't want to wait). My intervention was "are you really going to enumerate the whole database?". Final openrouter cost: $0.12

baldai8 hours ago | parent | next

They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.

loading story #48397051

loading story #48401008

jxmesth11 hours ago | parent

I'd love to see the results for Mimo v2.5 pro, been hearing a lot about it

Cakez0r11 hours ago | root | parent

It is totally slept on. In my experience it is cheap, fast and capable (not just capable with caveats, but just as capable as western flagships). My only gripe with it is that sometimes the API seems to timeout which tanks the overall speed of what is otherwise a very fast experience.

loading story #48399005

loading story #48400138

guessmyname16 hours ago | parent | next

I'd run Mythos against the code in your zip file, but the NDA I signed at Apple prevents me from using it on anything outside the scope of my work. Honestly, I wish more people from Project Glasswing could talk publicly about their experiences with the model. It would probably put an end to a lot of the speculation that keeps circulating through the industry. Unfortunately, that's not the reality we're in. I don't have the time, energy, or financial resources to fight a legal battle with one of these companies over an agreement I knowingly signed, even if the chances of them actually suing are low. Maybe someone else in Project Glasswing is willing to burn their NDA and post the Mythos results?

loading story #48397462

CaveTech16 hours ago | parent | next

It was found with gpt 5.5 7/10 times it’ll be trivially found by mythos

loading story #48392842

loading story #48394970

loading story #48398443

nznzjzizixnsnsj16 hours ago | parent | next

lol what is even the point of this kind of comment? this is the ultimate "source: trust me bro" comment I have ever seen.

every model since gpt3 was claimed to be "too dangerous to release." it's too EXPENSIVE to release, and you're probably a local model with <10B parameters yourself

loading story #48397270

loading story #48396585

tsunamifury16 hours ago | parent

cool.

taikahessu12 hours ago | parent | next

"The Chinese models were way more comfortable attacking the DB"

This comment in the footnotes made me chuckle, for purely innocuous reasons.

tjwheeler15 hours ago | parent | next

Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.

ikurei9 hours ago | parent | next

Qwen 3.7 Max: > During my local testing before the full eval harness it was the only non-GPT model that was able to complete the task, was not able to reproduce in the longer runs.

Doesn't that sound like may be the harness was the problem?

throwaway203710 hours ago | parent | next

Two of the tables have a column with header: "95% Wilson CI". What does this mean?

loading story #48396397

11 hours ago | parent | next

{"deleted":true,"id":48394849,"parent":48392343,"time":1780554854,"type":"comment"}

sperandeo14 hours ago | parent | next

I found benefit of chaining the task between different LLM's. Claude to Venice, Venice to Perplexity and re framing the intent or misguiding in general still works. Claude is the one that I can feel the guard rails tightening.

6 hours ago | parent | next

{"deleted":true,"id":48397466,"parent":48392343,"time":1780574597,"type":"comment"}

11 hours ago | parent | next

{"deleted":true,"id":48394902,"parent":48392343,"time":1780555177,"type":"comment"}

Clikdeo7 hours ago | parent | next

I think link is missing

chaidhat10 hours ago | parent | next

do you work at Uber by any chance?

yieldcrv7 hours ago | parent | next

> Almost every model used the canonical provider: Zai for GLM, Deepseek for Deepseek, etc.

> I am never touching Minimax or GLM again. Their APIs had constant outages

Goofy take

You run these on a VPS based on the architecture of that VPS provider, or on your own cluster

stuckkeys8 hours ago | parent | next

How does one apply for that “security research” pass?

loading story #48397480

youre-wrong313 hours ago | parent | next

“I used pi as the base harness”

Why do people keep using bad tools with ai?

hanikesn13 hours ago | parent

What's bad about it and what's a better one?

loading story #48395996

petesergeant11 hours ago | parent | next

Last year I ran a code breaking competition, and it was tricky to find something that humans could break but that LLMs couldn’t. This was around October. I managed it last year but am a little dispairing of pulling it off again this year.

loading story #48402289

loading story #48402297

latexr10 hours ago | parent | next

> I need to stop wasting fucking money on doing stupid shit. I could’ve done so many other things with the money. I could’ve launched one of my own real apps.

Or fed, clothed, housed disadvantaged people in your community (or neighbouring ones), giving them a temporary boost that could’ve made all the difference in their lives to improve their current situation.

It’s your money (and this is definitely not the website to make well-meaning altruistic suggestions, as might be demonstrated shortly) but if you already recognise you’re not spending it well (and from your words it seems like that is fairly recurrent), consider that perhaps spending it on a different type of software sink may not be the answer. Genuinely, aim to spend it on someone else and see how it works out. You might be surprised.

loading story #48402333

loading story #48400020

songting5915 hours ago | parent | next

[flagged]

aos_architect6 hours ago | parent | next

[flagged]

cgnguyen8 hours ago | parent | next

[dead]

ElenaDaibunny8 hours ago | parent | next

[dead]

mocmoc10 hours ago | parent | next

[dead]

capdrop11 hours ago | parent | next

[flagged]

gamander211 hours ago | parent

[dead]

#visit	13,562,002
#session	74,665
#live-session	0