The principle of maximum entropy
Say that you are a statistician and are asked to come up with a probability distribution for the current state of knowledge on some particular topic you know little about. (This, in Bayesian statistics, is known as choosing a suitable prior.) To do this, the safest bet is coming up with the least informative distribution via the principle of maximum entropy.
This principle is clearly explained by Jaynes (1968):
consider a die which has been tossed a very large number of times N
.
We expect the average to be 3.5, that is, we expect a distribution where P_n = \frac{1}{6}
for each n
, see the figure below.
using CairoMakie
using DataFrames
function plot_distribution(probabilities::Array)
fig = Figure(; size=(700, 400))
ax = Axis(fig[1, 1]; xlabel=L"n", ylabel=L"P_n", xticks=1:6, limits=(nothing, (0, 1)), height=200)
xlims!(ax, 0, 7)
barplot!(ax, 1:6, probabilities; color=:gray)
fig
end
plot_distribution([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
Instead, we are told that the average is 4.5
.
How likely is it for each number n = 1,2, \ldots, 6
to come up for the next toss?
Since we know that P
always sums to 1, we have
\sum_{n=1}^6 P_n = 1.
We also know that the average is 4.5
, that is,
\sum_{n=1}^6 n \cdot P_n = 4.5.
We could satisfy these constraints by choosing P_4 = P_5 = \frac{1}{2}
.
plot_distribution([0, 0, 0, 0.5, 0.5, 0])
This is unlikely to be the distribution for our data since it can be derived in relatively few ways, namely: by throwing only 4
and 5
, and in such a way that the throws average to 4.5
.
A more likely distribution would be
plot_distribution([0, 0, 1/4, 1/4, 1/4, 1/4])
This is still not the least informative distribution since it assumes n = 1
and n = 2
to be impossible events.
Jaynes presents the straight line solution P_n = (12n - 7)/210
,
plot_distribution([(12n - 7)/210 for n in 1:6])
This solution would also fail if the mean would have been higher, because then P_0 = 0
would occur again.
The correct measure is the following information measure (Shannon, 1948) which is also known as information entropy,
S_I = - \sum_i p_i \log p_i.
We can find p_i
for p_i = 1, 2, \ldots, 6
by maximizing S_I
for given constraints.
This problem, known as MaxEnt, is hard to solve manually since there are 6
unknowns and various constraints.
The solution can be approximated by rewriting it to a linear program.
Alternatively, analytic solutions exist for some subsets of this Shanon entropy maximization problem (Zabarankin and Uryasev, 2014).
Here, we have that the mean is known (and nothing else), so the number of moments m
is 1
.
Then, the maximum entropy distribution takes the form (Zabarankin and Uryasev, 2014; Eq. 5.1.7),
P_n = \frac{e^{\rho n}}{\sum_{n=1}^6 e^{\rho n}}, \: \text{ for } n = 1, 2, ..., 6.
This function satisfies \sum_{n=1}^6 P_n = 1
for any \rho
.
Now, we only have to find the \rho
for which the average is 4.5
.
After some trial and error, you'll find that \rho = 0.3715
gives \sum_{n=1}^6 n \cdot P_n \approx 4.501
.
plot_distribution([0.0543, 0.0787, 0.114, 0.165, 0.240, 0.348])
This is the least informative distribution which satisfies the constraints. In other words, this is the distribution which can be obtained in the largest number of ways, given the constraints. For another example of maximum entropy distributions, see Chapter 10.1 of the book by McElreath (2020).
Trial and error
p(k, rho) = exp(rho*k) / sum([exp(rho*1), exp(rho*2), exp(rho*3), exp(rho*4), exp(rho*5), exp(rho*6)])
function ps(rho)
values = map(k -> p(k, rho), 1:6)
@show values
sum_values = sum(values)
@show sum_values
average = sum([values[1]*1, values[2]*2, values[3]*3, values[4]*4, values[5]*5, values[6]*6])
@show average
Base.Text("""
values = $values
sum_values: $sum_values
average = $average
""")
end
ps(0.4)
values = [0.04906874617024226, 0.0732019674190579, 0.1092045029116822, 0.16291397453728548, 0.24303909080562353, 0.36257171815610867]
sum_values: 1.0
average = 4.565367850857316
ps(0.34)
values = [0.060524771142131895, 0.08503413138555115, 0.11946849800579813, 0.16784697842149762, 0.23581620791666263, 0.3313094131283586]
sum_values: 1.0
average = 4.427323959970083
ps(0.3715)
values = [0.05426741458481562, 0.07868275019416264, 0.11408273685935422, 0.165409455277101, 0.2398284670256302, 0.3477291760589363]
sum_values: 1.0
average = 4.501036338141376
Built with Julia 1.11.5 and
CairoMakie 0.12.16 DataFrames 1.7.0