Decision trees – the unreasonable power of nested decision rules

2026-03-018:5555882mlu-explain.github.io

An introduction to the Decision Trees, Entropy, and Information Gain.

Show article

The unreasonable power of nested decision rules.

Let's pretend we're farmers with a new plot of land. Given only the Diameter and Height of a tree trunk, we must determine if it's an Apple, Cherry, or Oak tree. To do this, we'll use a Decision Tree.

Almost every tree with a Diameter ≥ 0.45 is an Oak tree! Thus, we can probably assume that any other trees we find in that region will also be one.

This first decision node will act as our root node. We'll draw a vertical line at this Diameter and classify everything above it as Oak (our first leaf node), and continue to partition our remaining data on the left.

We continue along, hoping to split our plot of land in the most favorable manner. We see that creating a new decision node at Height ≤ 4.88 leads to a nice section of Cherry trees, so we partition our data there.

Our Decision Tree updates accordingly, adding a new leaf node for Cherry.

After this second split we're left with an area containing many Apple and some Cherry trees. No problem: a vertical division can be drawn to separate the Apple trees a bit better.

Once again, our Decision Tree updates accordingly.

The remaining region just needs a further horizontal division and boom - our job is done! We've obtained an optimal set of nested decisions. That said, some regions still enclose a few misclassified points. Should we continue splitting, partitioning into smaller sections?

Hmm...

If we do, the resulting regions would start becoming increasingly complex, and our tree would become unreasonably deep. Such a Decision Tree would learn too much from the noise of the training examples and not enough generalizable rules.

Does this ring familiar? It is the well known tradeoff that we have explored in our explainer on The Bias Variance Tradeoff! In this case, going too deep results in a tree that overfits our data, so we'll stop here.

We're done! We can simply pass any new data point's Height and Diameter values through the newly created Decision Tree to classify them as either an Apple, Cherry, or Oak tree!

Decision Trees are supervised machine learning algorithms used for both regression and classification problems. They're popular for their ease of interpretation and large range of applications. Decision Trees consist of a series of decision nodes on some dataset's features, and make predictions at leaf nodes.

Scroll on to learn more!

Decision Trees are widely used algorithms for supervised machine learning. They're popular for their ease of interpretation and large range of applications. They work for both regression and classification problems.

A Decision Tree consists of a series of sequential decisions, or decision nodes, on some data set's features. The resulting flow-like structure is navigated via conditional control statements, or if-then rules, which split each decision node into two or more subnodes. Leaf nodes, also known as terminal nodes, represent prediction outputs for the model.

To train a Decision Tree from data means to figure out the order in which the decisions should be assembled from the root to the leaves. New data may then be passed from the top down until reaching a leaf node, representing a prediction for that data point.

We just saw how a Decision Tree operates at a high-level: from the top down, it creates a series of sequential rules that split the data into well-separated regions for classification. But given the large number of potential options, how exactly does the algorithm determine where to partition the data? Before we learn how that works, we need to understand Entropy.

Entropy measures the amount of information of some variable or event. We'll make use of it to identify regions consisting of a large number of similar (pure) or dissimilar (impure) elements.

Given a certain set of events that occur with probabilities , the total entropy can be written as the negative sum of weighted probabilities:

The quantity has a number of interesting properties:

Entropy Properties

only if all but one of the are zero, this one having the value of 1. Thus the entropy vanishes only when there is no uncertainty in the outcome, meaning that the sample is completely unsurprising.
is maximum when all the are equal. This is the most uncertain, or 'impure', situation.
Any change towards the equalization of the probabilities increases .

The entropy can be used to quantify the impurity of a collection of labeled data points: a node containing multiple classes is impure whereas a node including only one class is pure.

Above, you can compute the entropy of a collection of labeled data points belonging to two classes, which is typical for binary classification problems. Click on the Add and Remove buttons to modify the composition of the bubble.

Did you notice that pure samples have zero entropy whereas impure ones have larger entropy values? This is what entropy is doing for us: measuring how pure (or impure) a set of samples is. We'll use it in the algorithm to train Decision Trees by defining the Information Gain.

With the intuition gained with the above animation, we can now describe the logic to train Decision Trees. As the name implies, information gain measures an amount the information that we gain. It does so using entropy. The idea is to subtract from the entropy of our data before the split the entropy of each possible partition thereafter. We then select the split that yields the largest reduction in entropy, or equivalently, the largest increase in information.

The core algorithm to calculate information gain is called ID3. It's a recursive procedure that starts from the root node of the tree and iterates top-down on all non-leaf branches in a greedy manner, calculating at each depth the difference in entropy:

To be specific, the algorithm's steps are as follows:

ID3 Algorithm Steps

Calculate the entropy associated to every feature of the data set.
Partition the data set into subsets using different features and cutoff values. For each, compute the information gain as the difference in entropy before and after the split using the formula above. For the total entropy of all children nodes after the split, use the weighted average taking into account , i.e. how many of the samples end up on each child branch.
Identify the partition that leads to the maximum information gain. Create a decision node on that feature and split value.
When no further splits can be done on a subset, create a leaf node and label it with the most common class of the data points within it if doing classification or with the average value if doing regression.
Recurse on all subsets. Recursion stops if after a split all elements in a child node are of the same type. Additional stopping conditions may be imposed, such as requiring a minimum number of samples per leaf to continue splitting, or finishing when the trained tree has reached a given maximum depth.

Of course, reading the steps of an algorithm isn't always the most intuitive thing. To make things easier to understand, let's revisit how information gain was used to determine the first decision node in our tree.

Recall our first decision node split on Diameter ≤ 0.45. How did we choose this condition? It was the result of maximizing information gain.

Each of the possible splits of the data on its two features (Diameter and Height) and cutoff values yields a different value of the information gain.

The line chart displays the different split values for the Diameter feature. Move the decision boundary yourself to see how the data points in the top chart are assigned to the left or right children nodes accordingly. On the bottom you can see the corresponding entropy values of both children nodes as well as the total information gain.

The ID3 algorithm will select the split point with the largest information gain, shown as the peak of the black line in the bottom chart of 0.574 at Diameter = 0.45.

Recall our first decision node split on Diameter ≤ 0.45. How did we choose this condition? It was the result of maximizing information gain.

Each of the possible splits of the data on its two features (Diameter and Height) and cutoff values yields a different value of the information gain.

The visualization on the right allows to try different split values for the Diameter feature. Move the decision boundary yourself to see how the data points in the top chart are assigned to the left or right children nodes accordingly. On the bottom you can see the corresponding entropy values of both children nodes as well as the total information gain.

The ID3 algorithm will select the split point with the largest information gain, shown as the peak of the black line in the bottom chart of 0.574 at Diameter = 0.45.

An alternative to the entropy for the construction of Decision Trees is the Gini impurity. This quantity is also a measure of information and can be seen as a variation of Shannon's entropy. Decision trees trained using entropy or Gini impurity are comparable, and only in a few cases do results differ considerably. In the case of imbalanced data sets, entropy might be more prudent. Yet Gini might train faster as it does not make use of logarithms.

Let's recap what we've learned so far. First, we saw how a Decision Tree classifies data by repeatedly partitioning the feature space into regions according to some conditional series of rules. Second, we learned about entropy, a popular metric used to measure the purity (or lack thereof) of a given sample of data. Third, we learned how Decision Trees use entropy in information gain and the ID3 algorithm to determine the exact conditional series of rules to select. Taken together, the three sections detail the typical Decision Tree algorithm. To reinforce concepts, let's look at our Decision Tree from a slightly different perspective.

The tree below maps exactly to the tree we showed in How to Build a Decision Tree section above. However, instead of showing the partitioned feature space alongside our trees structure, let's look at the partitioned data points and their corresponding entropy at each node itself:

From the top down, our sample of data points to classify shrinks as it gets partitioned to different decision and leaf nodes. In this manner, we could trace the full path taken by a training data point if we so desired. Note also that not every leaf node is pure: as discussed previously (and in the next section), we don't want the structure of our Decision Trees to be too deep, as such a model likely won't generalize well to unseen data.

Without question, Decision Trees have a lot of things going for them. They're simple models that are easy to interpret. They're fast to train and require minimal data preprocessing. And they hand outliers with ease. Yet they suffer from a major limitation, and that is their instability compared with other predictors. They can be extremely sensitive to small perturbations in the data: a minor change in the training examples can result in a drastic change in the structure of the Decision Tree.

Check for yourself how small random Gaussian perturbations on just 5% of the training examples create a set of completely different Decision Trees:

In their vanilla form, Decision Trees are unstable. If left unchecked, the ID3 algorithm to train Decision Trees will work endlessly to minimize entropy. It will continue splitting the data until all leaf nodes are completely pure - that is, consisting of only one class. Such a process may yield very deep and complex Decision Trees. In addition, we just saw that Decision Trees are subject to high variance when exposed to small perturbations of the training data.

Both issues are undesirable, as they lead to predictors that fail to clearly distinguish between persistent and random patterns in the data, a problem known as overfitting. This is problematic because it means that our model won't perform well when exposed to new data.

There are ways to prevent excessive growth of Decision Trees by pruning them, for instance constraining their maximum depth, limiting the number of leaves that can be created, or setting a minimum size for the amount of items in each leaf and not allowing leaves with too few items in them.

As for the issue of high variance? Well, unfortunately it's an intrinsic characteristic when training a single Decision Tree.

Perhaps ironically, one way to alleviate the instability induced by perturbations is to introduce an extra layer of randomness in the training process. In practice this can be achieved by creating collections of Decision Trees trained on slightly different versions of the data set, the combined predictions of which do not suffer so heavily from high variance. This approach opens the door to one of the most successful Machine Learning algorithms thus far: random forests.

Stay tuned for our future article!

Thanks for reading! We hope that the article is insightful no matter where you are along your Machine Learning journey, and that you came away with a better understanding of the Decision Tree algorithm. To make things compact, we skipped over some relevant topics, such as using Decision Trees for regression, end-cut preference in tree models, and other tree-specific hyperparameters. Check out the resources listed below to learn more about those topics.

To learn more about Machine Learning, check out our self-paced courses, our YouTube videos, and the Dive into Deep Learning textbook. If you have any comments or ideas related to MLU-Explain articles, feel free to reach out directly to Jared or Lucía. The code for this article is available here.

A special thanks goes out to Brent Werness for valuable contributions to this article.

This article is a product of the following resources + the awesome people who made (& contributed to) them:

A Mathematical Theory Of Communication
(Claude E. Shannon, 1948).

Induction of decision trees
(John Ross Quinlan, 1986).

A Study on End-Cut Preference in Least Squares Regression Trees
(Luis Torgo, 2001).

The Origins Of The Gini Index: Extracts From Variabilità e Mutabilità (Corrado Gini, 1912)
(Lidia Ceriani & Paolo Verne, 2012).

D3.js
(Mike Bostock & Philippe Rivière)

d3-annotation
(Susie Lu)

KaTeX
(Emily Eisenberg & Sophie Alpert)

Read the original article

Comments

By srean 2026-03-0113:115 reply

A 'secret weapon' that has served me very well for learning classifiers is to first learn a good linear classifier. I am almost hesitant to give this away (kidding).

Use the non-thresholded version of that linear classifier output as one additional feature-dimension over which you learn a decision tree. Then wrap this whole thing up as a system of boosted trees (that is, with more short trees added if needed).

One of the reasons why it works so well, is that it plays to their strengths:

(i) Decision trees have a hard time fitting linear functions (they have to stair-step a lot, therefore need many internal nodes) and

(ii) linear functions are terrible where equi-label regions have a recursively partitioned structure.

In the decision tree building process the first cut would usually be on the synthetic linear feature added, which would earn it the linear classifier accuracy right away, leaving the DT algorithm to work on the part where the linear classifier is struggling. This idea is not that different from boosting.

One could also consider different (random) rotations of the data to form a forest of trees build using steps above, but was usually not necessary. Or rotate the axes so that all are orthogonal to the linear classifier learned.

One place were DT struggle is when the features themselves are very (column) sparse, not many places to place the cut.

By whatever1 2026-03-0118:511 reply

If you think about it, this what reinforcement learning folks are doing. They use the vanilla state and then they lift it to the observed state by doing some additional calculation with the original state data.

For example you start with the raw coordinates of snake in a snake game, but you now can calculate how many escape routes the snake has, and train on it.

By AndrewKemendo 2026-03-0119:41

That’s correct

I’ve spent most of the last 20 years doing reinforcement learning and it is exceptionally simple conceptually

The challenge is data acquisition and the right types of process frameworks

By 3abiton 2026-03-0116:523 reply

I think it's worth mentioning, that the achilles heel of DT, is in fact, data (more specifically feature) engineering. If one does not spend significant time cleaning and engineering the features, the results would be much worse than, say a "black box" model, like NN. This is the catch. Ironically, NN can detect such latent features, but very difficult to interpret why.

By datsci_est_2015 2026-03-0120:451 reply

This varies so wildly from domain to domain. Highly structured data (time series, photos, audio, etc.) typically has a metric boatload of feature extraction methodology. Neural networks often draw on and exploit that structure (i.e. convolutions). You could even get some pretty good results on manually-extracted neural-network-esque features handed off to a random forest. This heuristic begins to fall off with deep learning though, which imo, is a precursor to LLMs and showed that emergent complexity is possible with machine learning.

But non-structured data? Pretty pointless to hand off to a neural network imo.

By srean 2026-03-0121:041 reply

Ah! Your comment helped me understand the parent comment so much more. I thought it was more about data hygiene needs.

Yes a DT on raw pixel values, or a DT on raw time values will in general be quite terrible.

That said the convolutional structure is hard coded in those neural nets, only the weights are learned. It is not that the network discovered on its own that convolutions are a good idea. So NNs too really (damn autocorrect, it's rely, rely) on human insight and structuring upon which they can then build over.

By thesz 2026-03-0222:45

  > So NNs too really (damn autocorrect, it's rely, rely) on human insight and structuring upon which they can then build over.

If so, one can repeat it with other basic models as well.

It is possible to train deep decision trees forests [1]. Then I believe that it is possible to train deep convolutional decision trees forests.

[1] https://arxiv.org/pdf/1702.08835

By srean 2026-03-0117:07

That has not been my experience though, apart from the need of standard data hygiene that one has to maintain for any ML exercise.

Normalising the numeric features to a common range has been adequate. This too is strictly not necessary for DTs, but pretty much mandatory for linear models. (DTs are very robust to scale differences among features, linear models quite vulnerable to the same.)

One can think of each tree path from root to leaf as a data driven formulation/synthesis of a higher level feature built out of logical conjunctions ('AND' operation).

These auto-synthesized / discovered features are then ORed at the top. DTs are good at capturing multi-feature interactions that single layer linear models can't.

NNs certainly synthesize higher level features, but what does not get highlighted enough is that learning-theory motivated Adaboost algorithm and it's derivatives do that too.

Their modus operandi is "BYOWC, bring your own weak classifier, I will reweigh the data in such a way that your unmodified classifier will discover a higher level feature on its own. Later you can combine these higher features linearly, by voting, or by averaging".

I personally favor differentiable models over trees, but have to give credit where it's due, DTs work great.

What leaves me intellectually unsatisfied about decision trees is that the space of trees cannot be searched in any nice principled way.

Or to describe in terms of feature discovery, in DT, there's no notion of progressing smoothly towards better high level features that track the end goal at every step of this synthesis (in fact greedy hill climbing hurts performance in the case of decision trees). DTs use a combinatorial search over the feature space partitioning, essentially by trial and error.

Neural nets have a smooth way of devising these higher level features informed by the end goal. Roll infinitesimally in the direction of steepest progress -- that's so much more satisfying than trial and error.

As for classification performance where DT have struggled are cases where columns are sparse (low entropy to begin with).

Another weakness of DT is the difficulty of achieving high throughput runtime performance, unless the DT is compiled into machine code. Walking a runtime tree structure with not so predictable branching probabilities that doe'snt run very fast on our standard machines. Compared to DNNs though this is a laughable complaint, throughput of DTs are an order of one or two faster.

By thesz 2026-03-0222:32

  > Ironically, NN can detect such latent features

Like they detect magazine logo on images of horses?

By ekjhgkejhgk 2026-03-0113:581 reply

> (ii) linear functions are terrible where equi-label regions have a partitioned structure.

Could you explain what "equi-label regions having a partitioned structure" mean?

By srean 2026-03-0114:061 reply

I missed a word "recursively", that I have edited in my original comment now.

Consider connected regions in the domain that have the same label. Much like countries on a political map. The situation where this has a short description in terms of recursive subdivision of space, is what I am calling a partitioned structure. It's really rather tautological.

By ekjhgkejhgk 2026-03-0115:291 reply

"recursively partitioned" sounds like a fractal to me. Not sure what you really mean.

By srean 2026-03-0115:33

Taken to the limit you are absolutely right.

It turns out many dataset have such a fractal like nature but where the partitioning needs to be cut off at a certain depth and not continued till infinity.

By u1hcw9nx 2026-03-0114:321 reply

There are decision trees for what you want do do.

Oblique Decision trees, Model Trees. (M5 Trees for example), Logistic Model Trees (LMT) or Hierarchical Mixture of Experts (HME).

By srean 2026-03-0114:37

Yes.

I mention restricted oblique trees in passing in my original comment. In my experience, oblique trees tend to add considerable complexity, the others more so. Of course whether the complexity is warranted or not will depend on the dataset.

The merit of what I used is in its simplicity. Any simple ML library would have a linear classifier and a tree learner.

Super easy to implement, train, maintain, debug. One to two person team can handle this fine.

By selimthegrim 2026-03-0119:341 reply

Is this not the heart of the IRM paper by Arjovsky, Bottou et al.?

By srean 2026-03-0119:56

Oh theirs is a far more sophisticated and deeper idea, to look for invariant representations.

Mine is a quick but effective practical hack fuelled by a little bit of insight.

By lokimedes 2026-03-0112:203 reply

When I worked at CERN around 2010, Boosted Decision Trees were the most popular classifier, exactly due to the (potential for) explainability along with its power of expression. We had a cultural aversion for neural networks back then, especially if the model was used in physics analysis directly. Times have changed…

By ekjhgkejhgk 2026-03-0117:121 reply

I used to be in physics but theory, not experiment. I have experience at work with decision trees in a different field.

I've always thought that the idea that decision trees are "explainable" is very overstated. The moment that you go past a couple of levels in depth, it becomes an un-interpretable jungle. I've actually done the exercise of inspecting how a 15-depth decision trees makes decision, and I found it impossible to interpret anything.

In a neural network you can also follow the successive matrix multiplications and relu etc through the layers, but you end up not knowing how the decision is made.

Thoughts?

By lokimedes 2026-03-0117:551 reply

I completely agree, as you may infer from my comment. The second multivariate models are relevant we effectively trade explainability for discrimination power. If your decision tree/model needs to be large enough to warrant SGD or similar optimization techniques, it is pretty much a fantasy to ever analyze it formally.

My second job after physics was AI for defense, and boy is the dream of explainable AI alive there.

Honesty anyone who “needs” AI to be understandable by dissection, suffers from control issues :)

By Otterly99 2026-03-0210:50

Do you have any good resource on explainable AI (XAI)? Most books I read on the subject tend to be a bit shallow and not very informative.

By srean 2026-03-0113:482 reply

> Times have changed…

This makes me a little concerned -- the use of parameters rich opaque models in Physics.

Ptolemaic system achieved a far better fit of planetary motion (over the Copernican system) because his was a universal approximator. Epicyclic system is a form of Fourier analysis and hence can fit any smooth periodic motion. But the epicycles were not the right thing to use to work out the causal mechanics, in spite of being a better fit empirically.

In Physics we would want to do more than accurate curve fitting.

By lokimedes 2026-03-0114:102 reply

If you sum up experimental physics into one heuristic it is “avoid fooling yourself with assumptions” - I left physics over a decade ago, but I feel confident that physicists still work hard to understand what they observe and don’t let LLMs have all the fun. If there’s one field of science where the scientists are legitimately allowed to go all the way back to basics, it’s elementary particle physics.

By YeGoblynQueenne 2026-03-0212:56

This guy says physicists now leave most of the work to LLMs:

https://www.youtube.com/watch?v=PctlBxRh0p4

(Original title was "Physicists are surrendering to AI" but it was changed since to something less clickbaity, which I appreciate. I posted it here on the original title: https://news.ycombinator.com/item?id=46859804)

By srean 2026-03-0114:241 reply

In general I would agree. I think it holds true at the highest levels.

What worries me is the noticeable uptick of presentations of the sort -- look ma better fit ... deep neural nets. These are mostly by more junior folks, but not necessarily. I have been in the audience in many.

These and the uptick in research proposals funded by providers of infra for such DNNs. I have been in the audience of many.

A charitable read could be that they just want the money and would do the principled thing.

By lokimedes 2026-03-0114:491 reply

Again, said as someone out of the fray, let’s hope it self-corrects. Physics is a very community driven field, and the young must try new things, and be allowed to, it is part of progress. It is when the seniors surrender the standard of quality they carry, we have trouble. And here, indeed, particle physics can be uniquely vulnerable - given the complexity and economics of the research, it is hard to falsify claims made with new methods if the established researchers cave too easily.

By srean 2026-03-0115:27

Amen.

To support your point of view, I haven't encountered this in particle physics. It's from other branches. I am not a Physicist myself, happened to be in a position to observe funding requests, request for collaborations.

By wodenokoto 2026-03-0112:392 reply

Are boosted decision trees the same as a boosted random forest?

By boccaff 2026-03-0112:49

short answer: No.

longer answer: Random forests use the average of multiple trees that are trained in a way to reduce the correlation between trees (bagging with modified trees). Boosting trains sequentially, with each classifier working on the resulting residuals so far.

I am assuming that you meant boosted decision trees, sometimes gradient boosted decisions trees, as usually one have boosted decision trees. I think xgboost added boosted RF, and you can boost any supervised model, but it is not usual.

By hansvm 2026-03-0113:13

The training process differs, but the resulting model only differs in data rather than code -- you evaluate a bunch of trees and add them up.

For better or for worse (usually for better), boosted decision trees work harder to optimize the tree structure for a given problem. Random forests rely on enough trees being good enough.

Ignoring tree split selection, one technique people sometimes do makes the two techniques more related -- in gradient boosting, once the splits are chosen it's a sparse linear algebra problem to optimize the weights/leaves (iterative if your error is not MSE). That step would unify some part of the training between the two model types.

By huqedato 2026-03-0114:132 reply

Random forests on the same site: https://mlu-explain.github.io/random-forest/

By s2l 2026-03-0115:08

And others: https://mlu-explain.github.io/

Hacker News