Hacker News new | past | comments | ask | show | jobs | submit
Everything is LLMs these days. LLMs this, LLMs that. Am I really missing out something from these muted models? Back when it was released, they were so much capable but now everything is muted to the point they are mostly autocomplete on steroids.

How can adding analytics to a system that is designed to act like humans produce any good? What is the goal here? Could you clarify why would some need to analyze LLMs out of all the things?

> Rich text data makes LLM traces unique, so we let you track “semantic metrics” (like what your AI agent is actually saying) and connect those metrics to where they happen in the trace

But why does it matter? Because at the current state these are muted LLMs overseen by the big company. We have very little to control the behavior and whatever we give it, it will mostly be 'politically' correct.

> One thing missing from all LLM observability platforms right now is an adequate search over traces.

Again, why do we need to evaluate LLMs? Unless you are working in a security, I see no purpose because these models aren't as capable as they used to be. Everything is muted.

For context: I don't even need to prompt engineer these days because it just gives similar result by using the default prompt. My prompts these are literally three words because it gets more of the job done that way than giving elaborate prompt with precise example and context.

They're not "muted". You just got used to them and figured out that they don't actually generete knew knowledge or information, they only give a statistically average summary of the top Google query. (I.e., they are super bland, boring and predictable.)
loading story #41455566
I found a LOT more value with personal python based API tools once I employed well described JSON schemas.

One of my clients must comply with a cyber risk framework with ~350 security requirements, many of which are so poorly written that misinterpretation is both common and costly.

But there are other, more well-written and described frameworks that include "mappings" between the two frameworks.

In the past I would take one of the vague security requirements, read the mapping to the well described framework to understand the underlying risk, the intent of the question, as well as likely mitigating measures (security controls). On average, that would take between 45-60 minutes per question. Multiply that out it's ~350 * 45 minutes or around 262 hours.

My first attempts to use AI for this yielded results that had some value, but lacked the quality to provide to the client.

On this past weekend, using python, Sonnet 3.5, JSON schemas, I managed to get the entire ~350 questions documented with a quality level exceeding what I could achieve manually.

It cost $10 in API credits and approx 14 hrs of my time (I'm sure a pro could easily achieve this in under 1 hour). The code itself was easy enough, but the big improvements came from the schema descriptions. That was the change that gave me the 'aha' moment.

I read over final results for dangerous errors (but ended up changing nothing at all) but just in case, I ran the results through GPT-4o which also found no issues that would prevent sending it to the client.

I would never get that job done manually, it's simply too much of a grind for a human to do cheaply or reliably.

Have you tried BAML (https://github.com/boundaryml/baml)? It's really good at structured output parsing. We integrated it directly into our pipeline builder.
Not yet, but its the weekend is just beginning, thanks for the tip.
(BAML founder here) feel free to jump on our Discord or email us if you have any issues with BAML! Here's our repo (with docs links) https://github.com/BoundaryML/baml and a demo: https://boundaryml.wistia.com/medias/5fxpquglde

People have used it to do anything from simple classifications to extracting giant schemas.

You are welcome! The easiest way to get started with BAML on Laminar is with our pipeline builder and Structured Output template. Check out the docs here (https://docs.lmnr.ai/pipeline/introduction)
Hey there, apologies for the late reply.

> Could you clarify why would some need to analyze LLMs out of all the things?

When you want to understand trends of the output of your Agent / RAG on scale, without looking manually at each trace, you need to another LLM to process the output. For instance, you want to understand what is the most common topic discussed with your agent. You can prompt another LLM to extract this info, Laminar will host everything, and turn this data into metrics.

> Why do we need to evaluate LLMs?

You right, devs who want to evaluate output of the LLM apps, truly care about the quality or some other metric. For this kind of cases evals are invaluable. Good example would be, AI drive-through agents or AI voice agents for mortgages (use cases we've seen on Laminar)

Topic modelling and classifications are real problems in LLM observability and evaluation, glad to see a platform doing this.

I see that you have chained prompts, does that mean I can define agents and functions inside the platform without having it in the code?

Yes! Our pipeline builder is pretty versatile. You can define conditional routing, parallel branches, and cycles. Right now we support LLM node and util nodes (json extractor). If you can defined your logic purely from those nodes (and in majority of cases you will be), then great, you can host everything on Laminar! You follow this guide (https://docs.lmnr.ai/tutorials/control-flow-with-LLM) it's bit outdated by gives you a good idea on how to create and run pipelines.
> Everything is LLMs these days. LLMs this, LLMs that. Am I really missing out something from these muted models? Back when it was released, they were so much capable but now everything is muted to the point they are mostly autocomplete on steroids.

it was my experience, too, then I tried out that cursor thing and turns out a well designed UX around claude 3.5 is the bees knees. it really does work, highly recommend the free trial. YMMV of course depending on what you work on; I tested it strictly on Python.

loading story #41455406
You're thinking about consumer use cases. Commercial uses cases are not "muted" by any means. The goal is to produce domain-specific JSON when fed some contextual data. And LLMs have only gotten better at doing so over time.