- Transcribing scanned documents into formatted text
- Captioning/describing images and classifying them for audience suitability (includes anti-spam)
- Matching documents with relevant Wikipedia pages for tagging
I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.
I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.
I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.
There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.
If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.
Could you talk a bit how you did the finetuning? Did you use unsloth or any other tool and how went the verification to proof the outcome?
Repo is https://github.com/Rebreda/listenr - mainly geared toward Whisper fine-tuning, AMD hardware and local inference
In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.
Then when I’m getting close to feature-complete, I’ll move to a hosted frontier model for the final integration.
Cost savings are enormous if you’re making dozens of calls to language models a minute.
I found Gemma 4 to be better, or at least more nuanced, than Gemini 2.5 Flash. And, the new Gemini 3.5 Flash is very good but is unrealistically expensive (ten times more expensive than DeepSeek or MiMo). So, since I don't need extremely fast performance, a self-hosted Gemma 4 wins for a bunch of stuff.
I've also found Qwen 3.6 27B to be shockingly good at finding security bugs for its size. It beats several larger models, and is close to Gemini Pro 3.1 (but Gemini 3.5 Flash surprisingly beats it soundly). Since it only costs electricity, and my electricity is cheap and 100% renewable, I can use it more broadly than I might otherwise use a hosted model.
All that said, the smart money is still on buying the subsidized tokens from the providers that offer them, rather than buying the hardware needed to run models that are 30+GB in size, as all of the ones I've been using regularly are (8-bit quantization, as they get a little dumber for every bit you drop below that). A $100 subscription to Claude or Codex currently provides access to the best models at a heavily discounted rate. And, DeepSeek/MiMo are extremely cheap, one or more orders of magnitude cheaper than the top models from Anthropic or OpenAI, if you need an API for automated usage. I spent about $4000 on my two inference machines (a Strix Halo with 128GB unified RAM, and a new desktop build based around two cheap old 32GB AMD data center GPUs), which buys a lot of tokens for tiny models like this...probably a couple/few years worth. But, I like tinkering, so having an excuse to play with hardware is its own reward. If it happens to pay me back some of that money, that's a bonus.
Of course, as the major providers decide they need to ring the cash register and stop burning money on subsidized tokens, that math may change, and I may find I'm grateful to have already bought this stuff before the RAM prices made everything 2-3x more expensive.
But, I think if you're not interested in learning about the technology and doing your own training experiments and such, you should probably not try to run stuff locally most of the time.
Wow LLMs are changing the world, what a utopia.
Think almost like unix pipelines, have used it for many workflows.
They are charging $15.00 an hour for an llm powered assistant. Like wtf, how do these people think that's a valid business model. This will 1000% annoy every customer that uses it. I hate this timeline so much.