TL;DR
- Information is the inverse of probability. Rare events = High Information.
- Shannon Entropy is the average amount of information (surprise) inherent in a system.
- Cross-Entropy measures the difference between two probability distributions: the Truth ($P$) and our Prediction ($Q$).
When we train neural networks, we are essentially trying to lower our surprise about the data until our model’s predictions align perfectly with reality.
Sometime ago, during my internship, I stumbled upon this reply tweet to a tweet by Yi Ma (@YiMaTweets).
Essentially entropy is a measure of volume of the space occupied by the data. It is better viewed as the number of epsilon-balls needed to pack the space...
— Yi Ma (@YiMaTweets) May 6, 2025
The formula looked so familiar but had a slightly different form. I quickly googled the formula for Cross-Entropy Loss, a concept I use constantly from simple image classification tasks to Language Model pre-training. Although Cross-Entropy had the same structure, the formula in the tweet—Shannon Entropy—had different terms.
As a prospective graduate student, I realized I wanted to understand this intuitively. It turns out I had actually learned this formula in my first-year telecommunication engineering lectures. But back then, it was just symbols on a slide. No intuition. Just math.
This post aims to bridge that gap. I want to explain the intuition I wish I had back then, moving from simple coin flips to the loss functions we use to train neural networks.
1. What is Information?
To understand Entropy, we first have to redefine how we see information. According to Information Theory, information isn’t just data—it is a measure of surprise.
Consider the event of the sun rising in the East.
There is absolutely no surprise there. We know for a fact it happens every day.
\[P(\text{Sun rises East}) = 1\]If you tell your mom, The sun rose in the East today, she won’t be surprised. You haven’t conveyed any new information. But if you tell her, Tomorrow the sun will rise in the West, she will be amazed. That is a highly improbable event, so it carries a massive amount of information.
Key Intuition:
- High Probability ($P \approx 1$) → Low Surprise → Low Information
- Low Probability ($P \approx 0$) → High Surprise → High Information
Therefore, Information $I(x)$ has an inverse relationship to the probability of an event:
\[I(x) \approx \frac{1}{p(x)}\]Why the Logarithm?
If an event is incredibly rare (like $p(x) = 0.0000001$), then $1/p(x)$ becomes a massive number. To make this manageable and to satisfy some nice mathematical properties, we scale this using the logarithm.
\[I(x) = \log_2 \left( \frac{1}{p(x)} \right) = -\log_2 p(x)\]We use base 2 because in computer science, we measure information in bits.
2. Counting Bits: The Coin vs. The Lottery
Let’s look at a concrete example to verify this.
The Coin Flip
Imagine I flip a fair coin. The probability of heads is 0.5. How much information or surprise do I get when I see heads?
\[I(\text{Heads}) = -\log_2(0.5) = 1 \text{ bit}\]This makes perfect sense. To tell a computer the result of a coin flip, you need exactly 1 bit (0 or 1).
The Lottery (Biased Events)
Now, imagine a lottery. You have a 1 in a billion chance of winning, and a massive chance of losing.
- If you win: You are incredibly surprised. The math gives you roughly 30 bits of information. (Huge surprise!)
- If you lose: You aren’t surprised at all. The math gives you roughly 0.0000001 bits. (No surprise).
3. Shannon Entropy: The Average Surprise
We now know how to calculate the surprise for a single event. But in a system (like a coin toss or a language model), we have many possible outcomes.
Entropy (denoted as $H$) is simply the Expected Value (weighted average) of that surprise. It tells us: on average, how unpredictable is this system?
\[H(X) = \mathbb{E}[I(x)] = \sum p(x) \cdot I(x)\]Substituting our information formula:
\[H(X) = - \sum p(x) \log p(x)\]Let’s compare our two examples using this formula:
- The Fair Coin: The outcomes (Heads/Tails) are equally likely. It is maximum chaos. You have no idea what will happen. Entropy is High (1 bit).
- The Lottery: The outcomes are heavily biased. You are almost certain you will lose. The system is very predictable. Entropy is Low (near 0 bits).
Takeaway: Entropy measures the uncertainty of a probability distribution. A flat distribution (everything is equally likely) has the highest Entropy. A spiky distribution (one thing is certain) has the lowest Entropy.
4. From Entropy to Cross-Entropy
Now, let’s connect this to Deep Learning.
In Entropy, we assume we know the true probability distribution $P(x)$ of the universe. But in Machine Learning, we don’t know the truth—we are trying to model it.
- Let $P(x)$ be the True Distribution (Ground Truth).
- Let $Q(x)$ be our Predicted Distribution (what our Neural Network thinks).
Cross-Entropy asks: What is the average surprise if the expected outcome is governed by reality ($P$), but we calculate the surprise using our model’s probabilities ($Q$)?
\[H(P, Q) = - \sum_{\text{all classes}} P(x) \log Q(x)\]Why do we minimize this?
In a classification task (like ImageNet), the True Distribution $P(x)$ is usually a One-Hot vector. If the image is a Cat, the true probabilities are:
- $P(\text{Cat}) = 1$
- $P(\text{Dog}) = 0$
- $P(\text{Bird}) = 0$
Our model might predict $Q(x)$:
- $Q(\text{Cat}) = 0.7$
- $Q(\text{Dog}) = 0.2$
- $Q(\text{Bird}) = 0.1$
If we plug these into the Cross-Entropy formula:
\[H(P, Q) = - [ (1 \cdot \log 0.7) + (0 \cdot \log 0.2) + (0 \cdot \log 0.1) ]\]Because of the zeros in $P(x)$, the terms for Dog and Bird disappear! We are left with:
\[H(P, Q) = - \log(0.7)\]This shows us why we use Cross-Entropy as a loss function. Minimizing Cross-Entropy is exactly the same as maximizing the log-probability of the correct class.
We want our model’s predicted probability $Q(\text{Cat})$ to get as close to 1 as possible. As $Q(\text{Cat}) \to 1$, the loss approaches 0.
When we train neural networks, we are essentially trying to lower our surprise about the data until our model’s predictions align perfectly with reality.