It still surprises me when I see people not prompting more specifically and clearly. It not only avoids problems, it's faster, costs less -and just works better.
I recently shared with a friend a multi-hour LLM chat session I'd done because it veered into a domain he's interested in. In the session I'd brainstormed and probed the feasibility of a novel concept for a new research direction. It traversed a half dozen domains diving into minute detail then zooming back out to survey an adjacent space, interspersed with intense skeptical probing of key assumptions, all while spewing tons of detailed citations, specific paragraph pulls, summarized data tables etc.
My friend is very experienced using LLMs for research so I was surprised when he called me shocked by the sheer velocity, precise targeting and signal/noise. I'd assumed everyone did it the same as I do. He attributed the different result solely to the way I crafted my prompts.
This doesn’t always work better. But often enough.
I noticed this last year and started experimenting which led to several realizations about how my prompt's tone, style, length, format, word choices and even punctuation can have very counter-intuitive impact on model responses. It's not that one strategy always gets "better" results, they're just different in specific ways, which can make one input style better for one context but worse for another. I first noticed this effect when modding my user prompt so major topic headings would always be numbered. It's surprisingly difficult to get it to reliably use the same simple scheme due to various potential ambiguities. So, I spent a little time word-smithing, lawyering and tuning the prompt but I found the closer I got to full compliance on heading numbering, the more unrelated things would drift. Like it would just stop using bullets, even though I never mentioned anything about bullets.
Then I changed the prompt to "Change nothing about your default formatting, except headings." But just mentioning anything related to formatting, could suddenly cause unintended effects on seemingly unrelated things. Then I tried being explicitly directive about all formatting to just lock it down. And this completely failed because once the formatting was perfect, I started noticing the model's output would get less intelligent much earlier in sessions. So I cleared my user prompt entirely as it wasn't worth the cognitive cost on the model or my time. A few days later in a long session I noticed it was numbering everything perfectly with no prompt at all. When I scrolled back through I saw it didn't start out numbering its responses. It started doing it because I was consistently numbering every major concept in my inputs, even though I never mentioned numbering or formatting.
So... yeah, subtle differences in prompts which absolutely shouldn't matter, do impact model output in unexpected ways. And, as of now, these effects can only be fully suppressed with strong directive prompts for short periods, but doing so always impacts other unrelated things - and has some cognitive impact on model performance. So, by paying a little attention, I've discovered ways to optimize a model's output in the direction I need by shifting not only my prompt's explicit directives but also the subliminal meta-elements like tone, style, length, structure, formatting, etc.
LLMs gain so much knowledge and capability from absorbing the symbolic relationships embedded in human language but in doing so, inevitably absorb many of the human foibles, sensitivities and weaknesses reflected in our languages.