Softmax forever, or why I like softmax

https://kyunghyuncho.me/softmax-forever-or-why-i-like-softmax/

An aside: please use proper capitalization. With this article I found myself backtracking thinking I’d missed a word, which was very annoying. Not sure what the authors intention was with that decision but please reconsider.

1dom1 day ago | parent | next

I agree.

I'm all for Graham's pyramid of disagreement: we should focus on the core argument, rather than superfluous things like tone, or character, or capitalisation.

But this is too much for me personally. I just realised I consider the complete lack of capitalisation on a piece of public intellectual work to be obnoxious. Sorry, it's impractical, distracting and generates unnecessary cognitive load for everyone else.

You're the top comment right now, and it's not about the content of the article at all, which is a real shame. All the wasted thought cycles across so many people :(

treetalker1 day ago | root | parent | next

Graham's Hierarchy, in "How to Disagree": https://paulgraham.com/disagree.html

MrMcCall20 hours ago | root | parent

Yeah, people should wake up to what people are really saying.

jiggawatts1 day ago | root | parent

It's a fad associated with AI, popularised by Sam Altman especially.

It's the new black turtleneck that everyone is wearing, but will swear upon their mother's life isn't because they're copying Steve Jobs.

loading story #43114949

bowsamic1 day ago | root | parent | next

Well at least it makes it easy to know who to avoid

msikora23 hours ago | root | parent | next

That is so incredibly dumb. I can get it in a tweet, but please, please, please properly capitalize in anything longer than a few words!

4ggr01 day ago | root | parent | next

i don't want to press the shift-key everytime i need a capitalized letter on my phone and i disable auto-correct because it constantly messes with native languages etc.

wasn't aware that this makes me a steve jobs copier :(

EDIT: people are seriously so emotionally invested in capitalization that i get downvoted into minus, jeez.

marssaxman1 day ago | root | parent | next

When you consciously choose to save yourself effort in writing, at the expense of the readers who are trying to make sense of what you are saying, the people onto whom you've transferred the cognitive load are not likely to appreciate your laziness.

4ggr05 hours ago | root | parent

your comment contains one, single, capitalized letter. if the first W in your comment would have been small, would that have made your comment so much harder to read?

does it make my comment so hard to read just because i don't start my sentences with big letters and don't capitalize myself(i)? really don't get the fuzz.

of course i capitalize letters in "official" texts, but we're in a comment section.

i find it doubly funny because english doesn't capitalize lots of things, anyways.

loading story #43119400

bowsamic23 hours ago | root | parent

> EDIT: people are seriously so emotionally invested in capitalization that i get downvoted into minus, jeez.

I find it weird that you would be surprised that people care about the quality of textual communication

loading story #43126621

loading story #43115025

alchemist1e91 day ago | root | parent

> It's a fad associated with AI, popularised by Sam Altman especially.

I know this is true but does anyone understand why they do it? It is actually cognitively disruptive when reading content because many of us are trained to simultaneously proof read while reading.

So I also consider it a type of cognitive attack vector and it annoys me extremely as well.

loading story #43118309

loading story #43115526

saghm1 day ago | root | parent | next

The sibling comment to yours mentions that this is pretty common on Twitter, and I'd guess that it started as a way to making firing off tweets from a phone easier (since the extra effort to hit shift when typing on a phone keyboard is a bit higher, and the additional effort to go back and fix any typos that happen due to trying to capitalize things is also higher compared to using a traditional keyboard). Once enough people were doing it there, the style probably became recognizable and evoked a certain "vibe" that people wanted to replicate elsewhere, including in places where the original context of "hitting the shift key is more work than it's worth" doesn't hold as well.

bowsamic1 day ago | root | parent

> since the extra effort to hit shift when typing on a phone keyboard is a bit higher, and the additional effort to go back and fix any typos that happen due to trying to capitalize things is also higher compared to using a traditional keyboard

I'm a bit confused about this. Do people turn off auto capitalisation on their phones? I very rarely have to press shift on my phone

meowface23 hours ago | root | parent | next

I and everyone I know turns it off. On many platforms and in many cultures, capitalization often implies Solemness or even rudeness in 1-on-1 conversations, and otherwise comes across as being out of touch in other kinds of communication.

alchemist1e919 hours ago | root | parent | next

Wow then I guess everyone finds me very rude. I capitalize, use correct grammar and spelling to the best of my ability in text messages just like any written communication. I find it rude when people don’t as I interpret it as they don’t even care enough about our communication to take the small effort to make their writing easy to comprehend and understand!

bowsamic22 hours ago | root | parent

I’ve never encountered anyone turning it off. I avoid socialising with people who think it’s rude to use capitalisation.

Izkata20 hours ago | root | parent

It's not directly rude, it's more like a serious tone of voice. But it only works like that when used unnecessarily, like in chat or IM where the message boundary styling doubles as a sentence boundary.

Using the chat/IM style outside of that context just doesn't work and looks really odd, like it's obviously someone who didn't learn those norms and is now mimicking them without understanding them.

bowsamic10 hours ago | root | parent

I only communicate seriously

loading story #43117933

loading story #43116022

loading story #43118606

PhilippGille1 day ago | parent | next

https://www.theguardian.com/society/2025/feb/18/death-of-cap...

bowsamic1 day ago | root | parent

Well I will fight this trend to the death. Thankfully I don't like to surround myself with philistines

meowface23 hours ago | root | parent

The war is already over.

I 100% agree lowercase in longform essays is ridiculous, but I think for everything aside from essays, articles, papers, long emails, and some percentage of multi-paragraph site comments, lowercase is absolutely going to be the default online in 20 years.

marssaxman22 hours ago | root | parent | next

So "everything that matters will continue to be written normally, but throwaway chatter will be written casually, where the specific features connoting casualness are a matter of ever-changing fashion"? Thinking back on '90s-era IRC chats, I suppose it was ever thus.

bowsamic22 hours ago | root | parent

> for everything aside from essays, articles, papers, long emails, and some percentage of multi-paragraph site comment

That’s already the only stuff worth reading and always has been. No loss then

handsclean1 day ago | parent | next

This is the norm for Gen Z. We don’t see it because children don’t set social norms where adults are present too, but with the oldest of Gen Z about to turn 30, you and I should expect to see this more and more, and get used to it. If every kid can handle it, I think we can, too.

loading story #43116987

timdellinger1 day ago | parent | next

an opinion, and a falsifiable hypothesis:

call me old-fasahioned, but two spaces after a period will solve this problem if people insist on all-lower-case. this also helps distinguish between abbreviations such as st. martin's and the ends of sentences.

i'll bet that the linguistics experimentalists have metrics that quantify reading speed measurements as determined by eye tracking experiments, and can verify this.

fc417fc80222 hours ago | root | parent | next

( do away with both capitalization and periods ( use tabs to separate sentences ( problem solved [( i'm only kind of joking here ( i actually think that would work pretty well ))] )))

( or alternatively use nested sexp to delineate paragraphs, square brackets for parentheticals [( this turned out to be an utterly cursed idea, for the record )] )

thaumasiotes1 day ago | root | parent

> [I]'ll bet that the linguistics experimentalists have metrics that quantify reading speed measurements as determined by eye tracking experiments, and can verify this.

You appear to be trolling for the sake of trolling, but for reference: reading speed is determined by familiarity with the style of the text. Diverging from whatever people are used to will make them slower.

There is no such thing as "two spaces" in HTML, so good luck with that.

recursive1 day ago | root | parent

> There is no such thing as "two spaces" in HTML, so good luck with that.

Code point 160 followed by 32. In other words `  ` will do it.

nomel22 hours ago | root | parent

There's: U+3000, ideographic space. It's conceptually fitting, with sentence separation being a good fit for "idea separation".

edit: well I tried to give an example, but hn seems to replace it with regular space. Here's a copy paste version: https://unicode-explorer.com/c/3000

agalunar21 hours ago | root | parent

Belying the name somewhat, I believe U+3000 is specifically meant for use with Sinoform logographs, having the size of a (fullwidth character) cell, and so it makes little sense in other contexts.

loading story #43121625

loading story #43121504

loading story #43118802

loading story #43114612

loading story #43118611

jppope1 day ago | parent

Language evolves. Capitalization is an artifact of a period where capitalizing the first letter made a lot of sense for the medium (parchment/paper). Modern culture is abandoning it for speed efficiency on keyboards or digital keyboards. A purist would say that we should still be using all capitals like they did in Greek/Latin which again was related to the medium.

I'll likely continue using Capitalization as a preference and that we use it to express conventions in programming, but I totally understand the movement to drop it and frankly its logical enough.

bicx1 day ago | root | parent | next

It's slower for sure, but capitalization does impart information: beginning of sentences, proper nouns, acronyms, and such. Sure, you could re-read the sentence until you figured all that out, but you are creating unnecessary hitches in the reading process. Capitalization is an optimization for the reader, and lack of capitalization is optimization for the writer.

loading story #43116691

loading story #43118398

sodapopcan23 hours ago | root | parent | next

As much as I dislike it sometimes, language absolutely does evolve. Proper capitalization does not fit into this, though. It can completely change the meaning of something if it is not capitalized. It's not just at the beginning of sentences, it's proper nouns within a sentence. Unfortunately I don't have an example of this handy but it's happened to me several times in my life where I've been completely confused by this (mostly on Slack).

This is a merely showing off your personal style which, when writing a technical article, I don't care about.

fc417fc80222 hours ago | root | parent

> we use it to express conventions in programming

Interestingly programming is the one place where I ditch it almost entirely (at least in my personal code bases).

loading story #43119594

loading story #43114185

stared1 day ago | parent | next

There are many useful tricks - like cosine distance.

In contrast, softmax has a very deep grounding in statistical physics - where it is called the Boltzmann distribution. In fact, this connection between statistical physics and machine learning was so fundamental that it was a key part of the 2024 Nobel Prize in Physics awarded to Hopfield and Hinton.

LudwigNagasena1 day ago | parent

Study of thermodynamics gave rise to many concepts in information theory and statistics, but I wouldn't say that there is any direct connection per se between thermodynamics and any field where statistics or information theory are applicable. And the reasoning behind the 2024 Nobel Prize in Physics was... quite innovative.

mitthrowaway21 day ago | root | parent

> I wouldn't say that there is any direct connection per se between thermodynamics and any field where statistics or information theory are applicable.

Thermodynamics can absolutely be studied through both a statistical mechanics and an information theory lens, and many physicists have found this to be quite productive and enlightening. Especially when it gets to tricky cases involving entropy, like Maxwell's Demon and Landauer's Eraser, one struggles not to do so.

loading story #43112079

incognito1241 day ago | parent | next

How to sample from a categorical: https://news.ycombinator.com/item?id=42596716

Note: I am the author

lelag1 day ago | parent

I'm happy to see you repaired your keyboard.

littlestymaar1 day ago | root | parent

I think they mean they're the author of the post they link, not the author of the OP with his broken caps.

lelag1 day ago | root | parent

Oh, right. I misunderstood.

semiinfinitely1 day ago | parent | next

i think that log-sum-exp should actually be the function that gets the name "softmax" because its actually a soft maximum over a set of values. And what we call "softmax" should be called "grad softmax" (since grad of logsumexp is softmax).

GistNoesis1 day ago | parent

softmax is badly named and should rather be called softargmax.

janalsncm1 day ago | parent | next

This is a really intuitive explanation, thanks for posting. I think everyone’s first intuition for “how do we turn these logits into probabilities” is to use a weighted sum of the absolute values of the numbers. The unjustified complexity of softmax annoyed me in college.

The author gives a really clean explanation for why that’s hard for a network to learn, starting from first principles.

calebm1 day ago | parent | next

Funny timing, I just used softmax yesterday to turn a list of numbers (some of which could be negative) into a probability distribution (summing up to 1). So useful. It was the perfect tool for the job.

loading story #43117687

yorwba5 days ago | parent | next

The author admits they "kinda stopped reading this paper" after noticing that they only used one hyperparameter configuration, which I agree is a flaw in the paper, but that's not an excuse for sloppy treatment of the rest of the paper. (It would however, be an excuse to ignore it entirely.)

In particular, the assumption that |a_k| ≈ 0 initially is incorrect, since in the original paper https://arxiv.org/abs/2502.01628 the a_k are distances from one vector to multiple other vectors, and they're unlikely to be initialized in such a way that the distance is anywhere close to zero. So while the gradient divergence near 0 could certainly be a problem, it doesn't have to be as fatal as the author seems to think it is.

totalizator1 day ago | parent | next

That would be "welcome to the world of academia". My post-doc friends won't even read a blog post prior to checking author's resume. They are very dismissive every time they notice anything they consider sloppy etc.

lblume1 day ago | root | parent | next

Which is a problem with the reputation-based academic system itself ("publish or perish") and not individuals working in it.

loading story #43112320

1 day ago | parent

{"deleted":true,"id":43111775,"parent":43066481,"time":1740033398,"type":"comment"}

nobodywillobsrv1 day ago | parent | next

Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.

efavdb1 day ago | parent | next

I personally don’t think much of the maximum entropy principle. If you look at the axioms that inform it, they don’t really seem obviously correct. Further, the usual qualitative argument is only right in a certain lens: namely they say choosing anything else would require you to make more assumptions about your distribution than is required. Yet it’s easy to find examples where the max entropy solution suppresses some states more than is necessary etc., which to me contradicts that qualitative argument.

semiinfinitely1 day ago | parent

right and it should be totally obvious that we would choose an energy function from statistical mechanics to train our hotdog-or-not classifier

C-x_C-f1 day ago | root | parent | next

No need to introduce the concept of energy. It's a "natural" probability measure on any space where the outcomes have some weight. In particular, it's the measure that maximizes entropy while fixing the average weight. Of course it's contentious if this is really "natural," and what that even means. Some hardcore proponents like Jaynes argue along the lines of epistemic humility but for applications it really just boils down to it being a simple and effective choice.

yorwba1 day ago | root | parent

In statistical mechanics, fixing the average weight has significance, since the average weight i.e. average energy determines the total energy of a large collection of identical systems, and hence is macroscopically observable.

But in machine learning, it has no significance at all. In particular, to fix the average weight, you need to vary the temperature depending on the individual weights, but machine learning practicioners typically fix the temperature instead, so that the average weight varies wildly.

So softmax weights (logits) are just one particular way to parameterize a categorical distribution, and there's nothing precluding another parameterization from working just as well or better.

C-x_C-f1 day ago | root | parent

I agree that the choice of softmax is arbitrary; but if I may be nitpicky, the average weight and the temperature determine one another (the average weight is the derivative of the log of the partition function with respect to the inverse temperature). I think the arbitrariness comes more from choosing logits as a weight in the first place.

loading story #43112333

loading story #43112585

littlestymaar1 day ago | parent | next

Off topic: Unlike many out there I'm not usually bothered by lack of capitalization in comments or tweets, but for an essay like this, it makes the paragraphs so hard to read!

Validark1 day ago | parent | next

If someone can't even put in a minimal amount of effort for basic punctuation and grammar, I'm not going to read their article on something more sophisticated. If you go for the lowercase i's because you want a childish or slob aesthetic, that can be funny in context. But in math or computing, I'm not going to care what someone thinks if they don't know or don't care that 'I' should be capitalized. Grammarly has a free tier. ChatGPT has a free tier. Paste your word slop into one of those and it will fix the basics for you.

LeifCarrotson1 day ago | root | parent

We just had a similar discussion at work the other day when one of our junior engineers noticed that a senior engineer was reflexively tapping the space bar twice after each sentence. That, too, was good style back when we were writing on typewriters or using monospace fonts with no typesetting. Only a child or a slob would fail to provide an extra gap between sentences, it would be distracting to readers and difficult to locate full stops without that!

But it's 2025, and HTML and Word and the APA and MLA and basically everyone agree that times and style guides have changed.

I agree that not capitalizing the first letter in a sentence is a step too far.

For a counter-example, I personally don't care whether they use the proper em-dash, en-dash, or hyphen--I don't even know when or how to insert the right one with my keyboard. I'm sure there are enthusiasts who care very deeply about using the right ones, and feel that my lack of concern for using the right dash is lazy and unrefined. Culture is changing as more and more communication happens on phone touchscreens, and I have to ask myself - am I out of touch? No, it's the children who are wrong. /s

But I strongly disagree that the author should pass everything they write through Grammarly or worse, through ChatGPT.

dagss1 day ago | parent

Same here, I just had to stop reading.

bambax1 day ago | parent | next

OT: refusing to capitalize the first word of each sentence is an annoying posture that makes reading what you write more difficult. I tend to do it too when taking notes for myself because I'm the only reader and it saves picoseconds of typing; but I wouldn't dream on inflicting it upon others.

sodapopcan23 hours ago | parent | next

It shows how laid back they are, man.

1 day ago | parent

{"deleted":true,"id":43113218,"parent":43112829,"time":1740048268,"type":"comment"}

xchip1 day ago | parent

The author is trying to show off, you can't tell this because his explanation makes no sense and made it overcomplicated to look smart.

#visit	12082239
#session	46787
#live-session	0