Decision trees – the unreasonable power of nested decision rules

https://mlu-explain.github.io/decision-tree/

333mschnell | 13 hours ago | 61 | HN

A 'secret weapon' that has served me very well for learning classifiers is to first learn a good linear classifier. I am almost hesitant to give this away (kidding).

Use the non-thresholded version of that linear classifier output as one additional feature-dimension over which you learn a decision tree. Then wrap this whole thing up as a system of boosted trees (that is, with more short trees added if needed).

One of the reasons why it works so well, is that it plays to their strengths:

(i) Decision trees have a hard time fitting linear functions (they have to stair-step a lot, therefore need many internal nodes) and

(ii) linear functions are terrible where equi-label regions have a recursively partitioned structure.

In the decision tree building process the first cut would usually be on the synthetic linear feature added, which would earn it the linear classifier accuracy right away, leaving the DT algorithm to work on the part where the linear classifier is struggling. This idea is not that different from boosting.

One could also consider different (random) rotations of the data to form a forest of trees build using steps above, but was usually not necessary. Or rotate the axes so that all are orthogonal to the linear classifier learned.

One place were DT struggle is when the features themselves are very (column) sparse, not many places to place the cut.

ekjhgkejhgk7 hours ago | parent | next

> (ii) linear functions are terrible where equi-label regions have a partitioned structure.

Could you explain what "equi-label regions having a partitioned structure" mean?

srean7 hours ago | root | parent

I missed a word "recursively", that I have edited in my original comment now.

Consider connected regions in the domain that have the same label. Much like countries on a political map. The situation where this has a short description in terms of recursive subdivision of space, is what I am calling a partitioned structure. It's really rather tautological.

loading story #47207051

lokimedes9 hours ago | parent | next

When I worked at CERN around 2010, Boosted Decision Trees were the most popular classifier, exactly due to the (potential for) explainability along with its power of expression. We had a cultural aversion for neural networks back then, especially if the model was used in physics analysis directly. Times have changed…

srean8 hours ago | parent | next

> Times have changed…

This makes me a little concerned -- the use of parameters rich opaque models in Physics.

Ptolemaic system achieved a far better fit of planetary motion (over the Copernican system) because his was a universal approximator. Epicyclic system is a form of Fourier analysis and hence can fit any smooth periodic motion. But the epicycles were not the right thing to use to work out the causal mechanics, in spite of being a better fit empirically.

In Physics we would want to do more than accurate curve fitting.

loading story #47206832

wodenokoto9 hours ago | parent

Are boosted decision trees the same as a boosted random forest?

hansvm8 hours ago | root | parent | next

The training process differs, but the resulting model only differs in data rather than code -- you evaluate a bunch of trees and add them up.

For better or for worse (usually for better), boosted decision trees work harder to optimize the tree structure for a given problem. Random forests rely on enough trees being good enough.

Ignoring tree split selection, one technique people sometimes do makes the two techniques more related -- in gradient boosting, once the splits are chosen it's a sparse linear algebra problem to optimize the weights/leaves (iterative if your error is not MSE). That step would unify some part of the training between the two model types.

boccaff9 hours ago | root | parent

short answer: No.

longer answer: Random forests use the average of multiple trees that are trained in a way to reduce the correlation between trees (bagging with modified trees). Boosting trains sequentially, with each classifier working on the resulting residuals so far.

I am assuming that you meant boosted decision trees, sometimes gradient boosted decisions trees, as usually one have boosted decision trees. I think xgboost added boosted RF, and you can boost any supervised model, but it is not usual.

huqedato7 hours ago | parent | next

Random forests on the same site: https://mlu-explain.github.io/random-forest/

fooker10 hours ago | parent | next

Fun fact - single bit neural networks are decision trees.

In theory, this means you can 'compile' most neural networks into chains of if-else statements but it's not well understood when this sort of approach works well.

smokel7 hours ago | parent | next

> single bit neural networks are decision trees.

I didn't exactly understood what was meant here, so I went out and read a little. There is an interesting paper called "Neural Networks are Decision Trees" [1]. Thing is, this does not imply a nice mapping of neural networks onto decision trees. The trees that correspond to the neural networks are huge. And I get the idea that the paper is stretching the concept of decision trees a bit.

Also, I still don't know exactly what you mean, so would you care to elaborate a bit? :)

[1] https://arxiv.org/pdf/2210.05189

lioeters7 hours ago | root | parent

Closest thing I found was:

Single Bit Neural Nets Did Not Work - https://fpga.mit.edu/videos/2023/team04/report.pdf

> We originally planned to make and train a neural network with single bit activations, weights, and gradients, but unfortunately the neural network did not train very well. We were left with a peculiar looking CPU that we tried adapting to mine bitcoin and run Brainfuck.

Almondsetat10 hours ago | parent | next

Do you know of any software that does this? Or any papers on the matter? It could be a fun weekend project

tomashubelbauer10 hours ago | root | parent

Made me think of https://github.com/xoreaxeaxeax/movfuscator. Would be definitely cool to see it realized even if it would be incredibly impractical (probably).

10 hours ago | parent

{"deleted":true,"id":47205812,"parent":47205668,"time":1772364872,"type":"comment"}

zelphirkalt10 hours ago | parent | next

Decision trees are great. My favorite classical machine learning algorithm or group of algorithms, as there are many slight variations of decision trees. I wrote a purely functional (kind of naive) parallelized implementation in GNU Guile: https://codeberg.org/ZelphirKaltstahl/guile-ml/src/commit/25...

Why "naive"? Because there is no such thing as NumPy or data frames in the Guile ecosystem to my knowledge, and the data representation is therefore probably quite inefficient.

srean9 hours ago | parent

What benefit does numpy or dataframes bring to decision tree logic over what is available in Guile already ? Honest question.

Guile like languages are very well suited for decision trees, because manipulating and operating on trees is it's mother tongue. Only thing that would be a bit more work would be to compile the decision tree into machine code. Then one doesn't have traverse a runtime structure, the former being more efficient.

BTW take a look at Lush, you might like it.

https://lush.sourceforge.net/

https://news.ycombinator.com/item?id=2406325

If you are looking for vectors and tensors in Guile, there is this

https://wedesoft.github.io/aiscm/

zelphirkalt8 hours ago | root | parent | next

I think data frames are quite memory efficient and can store non-uniform data types (as can vectors in Guile). Generally, a ton of work has gone into making operations on data frames fast. I don't think a normal vector or multi-dimensional array can easily compete. Data frames are probably also compiled to some quite efficient machine code. Not sure whether Guile's native data structures can match that. Maybe they can.

Also I think I did not optimize for memory usage, and my implementation might keep copies of subsets of data points for each branch. I was mostly focused on the algorithm, not that much on data representation.

Another point, that is not really efficiency related, is that data frames come with lots of functionality to handle non-numeric data. If I recall correctly, they have functionality like doing one-hot encoding and such things. My implementation simply assumes all you have is numbers.

There might also be efficiency left on the table in my implementation, because I use the native number types of Guile, which allow for arbitrarily large integers (which one might not need in many cases) and I might even have used fractions, instead of inexact floats.

I guess though, with good, suitable data structures and a bit of reworking the implementation, one could get a production ready thing out of my naive implementation, that is even trivially parallelized and still would have the linear speedup (within some bounds only, probably, because decision trees usually shouldn't be too deep, to avoid overfitting) that my purely functional implementation enables.

Thanks for the links!

srean8 hours ago | root | parent

For linear algebraic transformation applied to several rows at once, I wholeheartedly agree.

Not so convinced about decision trees though (that process one row at a time).

Yeah, unless you had to deal with arbitrarily large integer features, Guile integers would come with a big efficiency hit.

boccaff9 hours ago | root | parent

tree algorithms on sklearn use parallel arrays to represent the tree structure.

loading story #47208604

kqr10 hours ago | parent | next

Experts' nebulous decision making can often be modelled with simple decision trees and even decision chains (linked lists). Even when the expert thinks their decision making is more complex, a simple decision tree better models the expert's decision than the rules proposed by the experts themselves.

I've long dismissed decision trees because they seem so ham-fisted compared to regression and distance-based clustering techniques but decision trees are undoubtedly very effective.

See more in chapter seven of the Oxford Handbook of Expertise. It's fascinating!

ablob9 hours ago | parent

I once saw a visualization that basically partitioned decisions on a 2D plane. From that perspective, decision trees might just be a fancy word for kD-Trees partitioning the possibility space and attaching an action to the volumes.

Given that assumption, the nebulous decision making could stem from expert's decisions being more nuanced in the granularity of the surface separating 2 distinct actions. It might be a rough technique, but nonetheless it should be able to lead to some pretty good approximations.

srean9 hours ago | root | parent

You have this thing a little backwards that it is unintentionally hilarious.

Decision trees predate KD trees by a decade.

Both use recursive partitioning of function domain a fundamental and an old idea.

loading story #47208484

loading story #47207299

xmprt12 hours ago | parent | next

Interesting website and great presentation. My only note is that the color contrast of some of the text makes it hard to read.

thesnide11 hours ago | parent

exactly my thought. and here thr reader view of FF is a godsend.

having 'accessible' content is not only for people with disabilities, it also help with bad color taste.

well, at least bad taste for readable content ;)

loading story #47209885

loading story #47208836

moi238810 hours ago | parent | next

That was beautifully presented!

EGreg7 hours ago | parent | next

Isn’t that exactly how humans (and even animals) operate?

Human societies look for actual major correlations and establish classifications. Except with scientific-minded humans, we often also want, to know the why behind the correlations. David Hume got involved w that… https://brainly.com/question/50372476

Let me ask a provocative question. What, ultimately, is the difference between knowledge and bias?

Jaxon_Varr12 hours ago | parent | next

[dead]

loading story #47209942

#visit	12,936,736
#session	74,665
#live-session	0