import pandas as pd
import numpy as np
import scipy.stats as stats

from ipywidgets import *
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.dpi'] = 70

Polynomial Curve Fitting

Fitting the data is a typical approach used in many machine learning techniques. However, a method that focuses purely on fitting the data well does not necessarily offer the best model. Such a phenomenon which is known as over-fitting.

Suppose we have observed 13 data points (called training data because we use data to train our forecasting model) that are generated from the function $\sin(2 \pi x)$ with some small perturbations. The figure below illustrates these training data in blue circles and the curve of $\sin(2 \pi x)$.

def sinusoidal(x):
    return np.sin(2 * np.pi * x)

def create_data(func, sample_size, std, domain=[0, 1]):
    x = np.linspace(*domain, sample_size)
    np.random.shuffle(x)
    t = func(x) + np.random.normal(scale=std, size=x.shape)
    return x, t

np.random.seed(11223)
x_train, t_train = create_data(sinusoidal, 13, 0.25)

x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)
plt.scatter(x_train, t_train, facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test, t_test, c="g", label="$\sin(2\pi x)$")
plt.ylim(-1.5, 1.5)
plt.legend(loc=1)
plt.show()

Our task is to predict the underlying data pattern by using simple polynomial functions with degree 1, 3 and 9. In other words, without knowing the true function $\sin(2 \pi x)$, we need to

  1. fit the given 13 training data points with simple polynomial functions,
  2. choose one of the polynomial functions to better represent the true function $\sin(2 \pi x)$ for forecasting future observations.

If you feel that the numerical example is not easy to comprehend, you may consider an application as follows.

The green $\sin(2 \pi x)$ curve is the true underlying business cycle (which has an up-and-down pattern), and blue circles are observed sales. In an ideal world, sales are fully described by the business cycle. However, in the real world sales may be perturbed from the true business cycle by other random facts (such as weather). That explains why those blue circles are generated from the $\sin(2 \pi x)$ curve with small perturbations. Our goal is to offer an understanding (i.e., a polynomial function) to properly explain the underlying business cycle.

The figure below shows the fitting polynomials with degree 1, 3 and 9 which are red curve (or line) in subgraphs.

np.random.seed(11223)
x_train, t_train = create_data(sinusoidal, 13, 0.25)

x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)

fig = plt.figure(figsize=(15, 4))
for i, degree in enumerate([1, 3, 9]):
    plt.subplot(1, 3, i+1)
    poly = PolynomialFeatures(degree=degree, include_bias=True)
    model = LinearRegression()
    model.fit(poly.fit_transform(x_train[:,None]),t_train[:,None])
    t = model.predict(poly.fit_transform(x_test[:,None]))
    plt.scatter(x_train, t_train, facecolor="none", edgecolor="b", s=50, label="training data")
    plt.plot(x_test, t_test, c="g", label="$\sin(2\pi x)$")
    plt.plot(x_test, t, c="r", label="fitting")
    plt.ylim(-1.5, 1.5)
    # plt.legend(loc=1)
    plt.title("polynomial fitting with dregree {}".format(degree))
    plt.annotate("degree={}".format(degree), xy=(0.5, 1.2))
plt.legend(bbox_to_anchor=(1.05, 0.34), loc=2, borderaxespad=0.5)
plt.show()

Note that a simple polynomial function with degree 1 is a linear function. Hence, the fitting procedure with a linear function is the well-known linear regression.

If we knew the underlying true function $\sin(2 \pi x)$, then it is quite clear that

  • the linear function is under-fitting because it doesn't explain the up-and-down pattern in the training data;
  • the polynomial function with degree 9 is over-fitting. Although the polynomial curve accurately describe the observed training data, it focuses too much on noises and does not correctly interpret the pattern of the underlying function LaTeX: \sin(2 \pi x);
  • the polynomial with degree 3 is the best because it balances the information and noises and extracts the valuable information from the training data.

However, as in many real world applications, we do not know the underlying true pattern of the training data, i.e., the function $\sin(2 \pi x)$, in our example. We can only observe a figure as below. Then, the figure poses a simple question to all decision makers: which polynomial is the best to present the data pattern for future forecasting?

np.random.seed(11223)
x_train, t_train = create_data(sinusoidal, 13, 0.25)

x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)

fig = plt.figure(figsize=(15, 4))
for i, degree in enumerate([1, 3, 9]):
    plt.subplot(1, 3, i+1)
    poly = PolynomialFeatures(degree=degree, include_bias=True)
    model = LinearRegression()
    model.fit(poly.fit_transform(x_train[:,None]),t_train[:,None])
    t = model.predict(poly.fit_transform(x_test[:,None]))
    plt.scatter(x_train, t_train, facecolor="none", edgecolor="b", s=50, label="training data")
    plt.plot(x_test, t, c="r", label="fitting")
    plt.ylim(-1.5, 1.5)
    plt.title("polynomial fitting with dregree {}".format(degree))
    plt.annotate("degree={}".format(degree), xy=(0.5, 1.2))
plt.legend(bbox_to_anchor=(1.05, 0.34), loc=2, borderaxespad=0.5)
plt.show()

If we only consider the error measure, we may choose the polynomial function with degree 9 as our model candidate for future forecasting, because this polynomial function describes the training data with less error (it pretty much passes all data points exactly). The issue is that the polynomial function with degree 9 captures both the information (the value comes from the true function $\sin(2 \pi x)$) and noises (the perturbations on each data points). We may wonder if it is possible to identify the polynomial with degree 3 as the best function and balance the information and noises.

The answer is positive. We can use holdout (or testing dataset) as follows:

  • choose 3 (or any small number of) data points as testing data
  • fit polynomial functions only on the remaining 10 data points
  • find the best fitting function based on those 3 chosen data points (i.e. testing data) The figure below illustrate the results where 3 orange circles are testing data points.
np.random.seed(11223)
x_train, t_train = create_data(sinusoidal, 13, 0.25)

x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)

fig = plt.figure(figsize=(15, 4))
for i, degree in enumerate([1, 3, 9]):
    plt.subplot(1, 3, i+1)
    poly = PolynomialFeatures(degree=degree, include_bias=True)
    model = LinearRegression()
    model.fit(poly.fit_transform(x_train[:-3,None]),t_train[:-3,None])
    t = model.predict(poly.fit_transform(x_test[:,None]))
    plt.scatter(x_train[:-3], t_train[:-3], facecolor="none", edgecolor="b", s=50, label="training data")
    plt.scatter(x_train[-3:], t_train[-3:], facecolor="none", edgecolor="orange", s=50, label="testing data")
    # plt.plot(x_test, t_test, c="g", label="$\sin(2\pi x)$")
    plt.plot(x_test, t, c="r", label="fitting")
    plt.ylim(-1.5, 1.5)
    plt.title("polynomial fitting with dregree {}".format(degree))
    plt.annotate("degree={}".format(degree), xy=(0.5, 1.2))
plt.legend(bbox_to_anchor=(1.05, 0.34), loc=2, borderaxespad=0.5)
plt.show()

Apparently, we can rule out linear regression since it did a poor job on explaining the training data. Although 9-degree polynomial perfectly fits the training data, it performs horribly on the testing data. The Y values correspond to 3 testing data points are actually out of the chart for 9-degree polynomial. Hence, 3-degree polynomial is the obvious winner based on the testing data.

The example illustrates the law of parsimony (Occam's razor): we generally prefer a simpler model because simpler models track only the most basic underlying patterns and can be more flexible and accurate in predicting the future!

We can obtain some quantitative insight into the predictability of different polynomial functions by considering the root-mean-square error (RMSE). As shown in the next graph, the RMSE of the training data decreases while the RMSE of the testing data increases if the complexity of the model (i.e., degree of polynomial functions) increases.

Note RMSE is defined as \begin{align} E_{RMSE} = \sqrt{\frac{\text{Sum of squared error between fitted $y$ and true $y$}}{\text{Num of Obs.}}} \nonumber \end{align}

in which the division by Num of Obs. allows us to compare different sizes of data sets on an equal footing.

np.random.seed(11223)
x_train, t_train = create_data(sinusoidal, 13, 0.25)

x_test = np.linspace(0, 1, 100)
t_test = sinusoidal(x_test)

polymodels = [np.poly1d(np.polyfit(x_train, t_train, degree)) for degree in range(13)]

# RMSE calculation for training and testing errors
training_errors = [np.sqrt(np.mean((t_train - polymodel(x_train)) ** 2)) for polymodel in polymodels]
test_errors = [np.sqrt(np.mean((t_test + np.random.normal(scale=0.25, size=len(x_test)) - polymodel(x_test)) ** 2))
               for polymodel in polymodels]

plt.figure(figsize=(6, 5))
plt.plot(training_errors, 'o-', mfc="none", mec="b", ms=10, c="b", label="training")
plt.plot(test_errors, 'o-', mfc="none", mec="r", ms=10, c="r", label="test")
plt.legend()
plt.xlabel("Degree")
plt.ylabel("RMSE")
plt.show()

Probability Distributions

Probability theory plays the central role in the solution of learning problems.

Bernoulli Distribution

\begin{align*} \text{Bern}(k|\theta) = \left\{ \begin{aligned} & \theta && \text{if } k = 1 \\ & 1-\theta && \text{if } k = 0 \end{aligned} \right. \end{align*}

Binomial Distribution

\begin{align*} \text{Binom}(k|n,\theta) = \left( \begin{array}{c} n \\ k \end{array} \right) \theta^k (1-\theta)^{n-k} = \frac{n!}{k!(n-k)!} \theta^k (1-\theta)^{n-k} \quad \forall k=0, \ldots, n \end{align*}

We note $\text{Bern}(k|\theta) = \text{Binom}(k|1,\theta)$. The plot shows the probability density function (histogram) of binomial distribution with $n=10$ and $\theta=0.25$.

data = stats.binom.rvs(10, 0.25, size=10000)
plt.hist(data, bins=data.max()+1, density=True)
plt.show()

Beta Distribution

\begin{align*} \text{Beta}(\theta|a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \theta^{a-1} \theta^{b-1} \end{align*}

and

\begin{align*} E(\theta) = \frac{a}{a+b} \end{align*}

The plots show the probability density function of Beta distribution with different $a$ and $b$ values.

x = np.linspace(0, 1, 100)
plt.figure(figsize=(6, 5))
for i, [a, b] in enumerate([[0.1, 0.1], [1, 1], [2, 3], [8, 4]]):
    plt.subplot(2, 2, i + 1)
    beta = stats.beta(a, b)
    plt.xlim(0, 1)
    plt.ylim(0, 3)
    plt.plot(x, beta.pdf(x))
    plt.annotate("a={}".format(a), (0.1, 2.5))
    plt.annotate("b={}".format(b), (0.1, 2.1))
plt.show()

Gaussian Distribution

For a $n$-dimensional vector $\mathbf{x}$, the multivariate Gaussian distribution takes the form

\begin{align*} \mathcal{N}(\mathbf{x}|\mu, \Sigma) = \frac{1}{(2 \pi)^{n/2}} \frac{1}{|\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \mu)^{\intercal} \Sigma^{-1} (\mathbf{x} - \mu) \right) \end{align*}

When $n=1$, we have the form

\begin{align*} \mathcal{N}(x|\mu, \sigma) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left( -\frac{1}{2 \sigma^2} (x - \mu)^2 \right) \end{align*}

The plot shows the probability density function of Gaussian (normal) distribution.

x = np.linspace(-1.2, 1.2, 100)
plt.figure(figsize=(5, 4))
plt.plot(x, stats.norm.pdf(x, 0, np.sqrt(0.3)), label="N=0")
plt.show()

Next, we demonstrate central limit theorem with histogram plots of the mean of $n$ uniformly distributed numbers

\begin{align*} \frac{x_1 + \ldots + x_n}{n} \end{align*}

for various values of $n$. We observe that as $n$ increases, the distribution tends towards a Gaussian.

plt.figure(figsize=(12, 4))

for i, num in enumerate([1, 2, 10]):
    plt.subplot(1, 3, i+1)
    plt.xlim(0, 1)
    plt.ylim(0, 5)
    plt.annotate("n={}".format(num), (0.1, 4.5))
    sample = 0
    for _ in range(num):
        sample = sample + np.random.uniform(0, 1, 10000)
        # np.random.uniform(0, 1, 10000) indicates random 10000 draws of x_{num}
    plt.hist(sample / num, bins=50, density=True)
plt.show()

Gamma Distribution

plt.figure(figsize=(6, 5))

x = np.linspace(0, 2, 100)
for i, [a, b] in enumerate([[0.1, 0.1], [1, 1], [2, 3], [4, 6]]):
    plt.subplot(2, 2, i + 1)
    plt.xlim(0, 2)
    plt.ylim(0, 2)
    plt.plot(x, stats.gamma.pdf(x, a, loc=0, scale=1/b))
    plt.annotate("a={}".format(a), (1, 1.6))
    plt.annotate("b={}".format(b), (1, 1.3))
plt.show()

Student's t Distribution

The plots show the probability density function of Student's t distribution.

plt.figure(figsize=(15, 5))
x = np.linspace(-8, 8, 400)
for i, t in enumerate([0.1, 1, 100]):
    plt.subplot(1, 3, i + 1)
    plt.xlim(-8, 8)
    plt.ylim(0, 0.5)
    plt.plot(x, stats.t.pdf(x, t))
    plt.annotate("$v$={}".format(t), (-.8, 0.45))
plt.show()

The distribution is used in statistics to estimate the mean of a sample under unknown variance. We fit the data with normal and student's t distributions. As we can see, the student's t distribution is more robust because is less sensitive to outliers.

plt.figure(figsize=(5, 4))
X = np.random.normal(size=20)
X = np.concatenate([X, np.random.normal(loc=20., size=3)])
plt.hist(X.ravel(), bins=50, density=1., label="Samples")

x = np.linspace(-12, 25, 1000)
plt.plot(x, stats.t.pdf(x, *stats.t.fit(X)), label="Student's t", linewidth=2)
plt.plot(x, stats.norm.pdf(x, *stats.norm.fit(X)), label="Gaussian", linewidth=2)
plt.legend()
plt.show()

Bayesian Framework

The Frequentist View

Let $x$ be a random variable and we are trying to estimate its mean $\mu$. The frequentist believes the true mean $\mu$ is a fixed value.

  • An unbiased estimate after $n$ observations $x_1, \ldots, x_n$ is \begin{align*} \mu_n = \frac{1}{n} \sum_{i=1}^{n} x_i \end{align*}
  • An unbiased estimate of the variance of $x$ is \begin{align*} \hat{\sigma}^2_n = \frac{1}{n-1} \sum_{i=1}^{n} \left( x_i - \mu_n \right)^2 \end{align*}

The estimate of mean and variance of $x$ can be written recursively as follows

\begin{align*} \mu_n &= \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{n-1}{n} \frac{1}{n-1} \sum_{i=1}^{n-1} x_i + \frac{1}{n} x_n = \left(1-\frac{1}{n}\right) \mu_{n-1} + \frac{1}{n} x_n \\ \hat{\sigma}^2_n &= \left\{\begin{aligned} &\frac{1}{n} \left( x_n - \mu_{n-1} \right)^2 && n=2 \\ &\frac{n-2}{n-1} \hat{\sigma}^2_{n-1} + \frac{1}{n} \left( x_n - \mu_{n-1} \right)^2 && n>2 \end{aligned} \right. \end{align*}

The estimator $\mu_n$ is a random variable because it is computed from other random variables, namely random samples $x_1, \ldots, x_n$. The estimate of the variance of the estimator $\mu_n$ is \begin{align*} \bar{\sigma}^2_n = \frac{1}{n} \hat{\sigma}^2_n \end{align*}

Theorem

We have

  • $\mu_n$ is an unbiased estimate of $\mu$

  • $\hat{\sigma}^2_n$ is an unbiased estimate of $\text{Var}(x)$

  • $\bar{\sigma}^2_n$ is an unbiased estimate of $\text{Var}(\mu_n)$

Proof
\begin{align*} E(\mu_n) &= \frac{1}{n} \sum_{i=1}^{n} E(x_i) = \mu \\ \end{align*}

We denote $\text{Var}(x)$ as $\sigma^2$. \begin{align*} E(\hat{\sigma}^2_{n}) &= \frac{1}{n-1} E \left( \sum_{i=1}^{n} (x_i - \mu_{n})^2\right) \\ &= \frac{1}{n-1} E \left( \sum_{i=1}^{n} x^2_i - 2\sum_{i=1}^{n} x_i \mu_{n} + \sum_{i=1}^{n} \mu^2_{n} \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^{n} E \left( x^2_i \right) - 2 E \left( \mu_{n} \sum_{i=1}^{n} x_i \right) + E\left(\sum_{i=1}^{n} \mu^2_{n} \right) \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^{n} E \left( x^2_i \right) - 2 n E \left( \mu_{n} \frac{1}{n}\sum_{i=1}^{n} x_i \right) + n E\left(\mu^2_{n} \right) \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^{n} E \left( x^2_i \right) - 2 n E \left( \mu^2_{n} \right) + n E\left(\mu^2_{n} \right) \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^{n} E \left( x^2_i \right) - n E \left( \mu^2_{n} \right) \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^{n} (\sigma^2 + \mu^2) - n \left( \text{Var} \left( \mu_{n} \right) + E^2(\mu_{n}) \right) \right) && \text{because $E(\mu_n) = \mu$} \\ &= \frac{n}{n-1} \left(\sigma^2 - \text{Var} \left( \mu^2_{n} \right) \right) \\ &= \frac{n}{n-1} \left(\sigma^2 - \text{Var} \left( \frac{1}{n} \sum_{i=1}^{n} x_i \right)\right) \\ &= \frac{n}{n-1} \left(\sigma^2 - \frac{1}{n^2} \sum_{i=1}^{n} \text{Var} \left( x_i \right) \right) && \text{because of i.i.d. samples} \\ &= \frac{n}{n-1} \left(\sigma^2 - \frac{1}{n^2} \sum_{i=1}^{n} \sigma^2 \right) = \sigma^2 \end{align*}

First, we show that $\text{Var}(\mu_n) = \sigma^2 / n$ \begin{align*} \text{Var}(\mu_n) &= \text{Var}\left(\frac{1}{n} \sum_{i=1}^n x_i\right) \\ & = \frac{1}{n^2} \text{Var}\left(\sum_{i=1}^n x_i\right) \\ & = \frac{1}{n^2} n \text{Var}\left(x\right) && \text{because of i.i.d. samples} \\ & = \frac{1}{n} \sigma^2 \end{align*}

Since $E(\hat{\sigma}^2_n) = \sigma^2$, $E(\bar{\sigma}^2_n)=\text{Var}(\mu_n)$.


The Bayesian View

Again, we try to estimate the mean $\mu$ of a random variable $x$. The Bayesian perspective considers $\mu$ as a random variable. In Bayesian view, any unknown number is interpreted as a random variable, and the distribution of this random variable represents our belief about how likely it takes on certain values.

Bayes' Theorem states \begin{align*} p(\mu|\mathcal{D}) = \frac{p(\mathcal{D}|\mu) p(\mu)}{p(\mathcal{D})} \end{align*}

or $\text{posterior} \propto \text{likelihood} \times \text{prior}$, where where $\mathcal{D} = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}$ is the observed dataset, and

  • $p(\mu)$ is a prior probability distribution which captures our assumptions about $\mu$ before observing the data;

  • $p(\mathcal{D}|\mu)$ is the likelihood function and expresses how probable the observed data set is for different settings of the parameter $\mu$ through the conditional probability;

  • $p(\mathcal{D})$ plays as a normalization constant because it does not depend on $\mu$. It ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one.

Conjugacy: Bayes rule can be simply expressed as

\begin{align*} \text{posterior} \propto \text{likelihood} \times \text{prior} \end{align*}

We would prefer to choose a prior such that the posterior distribution have the same functional form as the prior. This property is called conjugacy, and the prior is called conjugate prior. We will see several examples later. If the posterior distribution have the same functional form as the prior, then the Bayes rule can be easily applied recursively.

Copernicus Method

The main character of our story is J. Richard Gott III. who took a trip to Europe in 1969 and saw the stark symbol of the Cold War — Berlin Wall which had been built eight years earlier in 1961. A question in his mind is "how much longer it would continue to divide the East and West?".

Which movie is this picture from?

Answer

Bridge of Spies. The picture shows that James B. Donovan played by Tom Hanks walking besides the Berlin Wall.


This may sounds like an absurd prediction. Even setting aside the impossibility of predicting geopolitics, the question seems mathematically impossible: he needs to make a prediction from a single data point. But as ridiculous as this might seem on its face, we make such predictions all the time by necessity.

Gott applied his "Copernicus method" to the lifetime of the Berlin Wall. The method is published in Nature. Why is it called the Copernican method?

  • In about 1510, the astronomer Nicolaus Copernicus was obsessed by questions: Where are we? Where in the universe is the Earth? Copernicus made a radical imagining at his time that the Earth was not the center of the universe, but, in fact, nowhere special in particular.

  • When Gott arrived at the Berlin Wall, he asked himself the same question: Where am I? That is to say, where in the total life span of this artifact have I happened to arrive? In a way, he was asking the temporal version of the spatial question that had proposed by Copernicus four hundred years earlier. Gott decided to take the same step with regard to time.

How does Copernican method work?
Answer:
  • He made the assumption that the moment when he encountered the Berlin Wall was not special. My visit should be located at some random point between the Wall's beginning and its end.

  • We can divide that timeline up into quarters, like so.

  • Gott reasoned that his visit, because it was not special in any way, could be located anywhere on that timeline. From the standpoint of probability, that meant there was a 50 percent chance that it was somewhere in the middle portion of the wall's timeline — the middle two quarters, or 50 percent, of its existence.

  • When Gott visited in 1969, the Wall had been standing for eight years. If his visit took place at the very beginning of that middle portion of the Wall’s existence, Gott reasoned, that eight years would represent exactly one quarter of the way into its history. That would mean the Wall would exist for another 24 years, coming down in 1993.

  • If, on the other hand, his visit happened at the very end of the middle portion, then the eight years would represent exactly three quarters of the way into the Wall’s history. Each quarter would represent just 2.66 years, meaning that the wall could fall as early as 1971.

  • So by that logic, there was a 50 percent chance that the Wall would come down between 1971 (2.66, or 8/3 years into the future) and 1993 (24, or 8 x 3 years into the future). In reality, the Wall fell in 1989, well within his predicted range. #####


The Copernican method used by Gott gives a 50 percent chance range that the Wall would come down. Suppose we were in the front of the Berline Wall in 1969, given the only information that the Berline Wall has existed for 8 years, can we predict a single number for its lifetime? The answer is positive.

How to predict a single number for its lifetime?
Answer

Recall that Gott made the assumption that the moment when he encountered the Berlin Wall wasn't special--that it was equally likely to be any moment in the wall's total lifetime. And if any moment was equally likely, then on average his arrival should have come precisely at the halfway point (Why? since it was 50% likely to fall before halfway and 50% likely to fall after).

More generally, unless we know better we can expect to have shown up precisely halfway into the duration of any given phenomenon. And if we assume that we're arriving precisely halfway into the duration, the best guess we can make for how long it will last into the future becomes obvious: exactly as long as it's lasted already. Since Gott saw Berlin Wall eight years after it was built, so our best guess was that it would stand for eight years more. We may predict that the Berlin Wall would fall in 1977. Although the prediction is not precise (predition will never be precise), it is the best estimate given that the only information is that the Berline Wall has existed for 8 years.


After knowing how to make a prediction based on a single data point, you can try to answer the following two questions:

Two questions?
  • You arrive at a bus stop in a foreign city and learn, perhaps, that the other tourist standing there has been waiting 7 minutes. When is the next bus likely to arrive? Is it worthwhile to wait — and if so, how long should you do so before giving up?

  • A friend of your friend's has been dating somebody for ONE Month and wants your advice: is it too soon to invite him/her along to an upcoming family wedding? The relationship is off to a good start, but how far ahead is it safe to make plans?

Answers

The answers are 7 minutes and 1 month.


Prior and Observation

Consider a game of flipping coins. There are two different coins. One is a fair coin with a 50–50 chance of heads and tails as shown on the left of the next figure); the other is a two-headed coin as shown as shown on the right of the next figure:

I drop them into a bag and then pull one out at random. Then I flip it once: head. Although you do not know which coin I pulled out, you have observed the outcome of the flip: head. Based on your best guess,

Which coin do you think I flipped?
Answer

There is 100% change to get head for two-headed coin, but only 50% for the normal coin. Thus we can assert confidently that it's exactly twice as probable that I had pulled out the two headed coin. Therefore it is in your best interest to guess the two headed coin.


Now I show you 9 fair coins and 1 two-headed coin as shown in the figure:

I put all ten into a bag, draw one at random, and flip it: head. Now what do you suppose?

Is it a fair coin or the two-headed one?
Answer

You should guess the fair coin. As before, a fair coin is exactly half as likely to come up heads as a two-headed coin. But now, a fair coin is also 9 times as likely to have been drawn in the first place. Its exactly 4.5 times more likely that I am holding a fair coin than the two-headed one.


In summary, the number of coins and its probability of showing head are our priors. The outcome of a flip is our observation. We have different predictions when our priors are different. The game of flipping coins perfectly inteprets the Bayes rule: \begin{align} \text{prediction} \approx \text{observations} + \text{prior knowledge}. \nonumber \end{align}

Revisit Copernican method: After Gott published his conjecture in Nature, the journal received a flurry of critical correspondence. And it's easy to see why when we try to apply the rule to some more familiar examples. If you meet a 90-year-old man, the Copernican Principle predicts he will live to 180. Every 6-year-old boy, meanwhile, is predicted to face an early death at the age of 12. The Copernican method does not make sense when we have prior information that people rarely live past 100, and 6 year old kids normally live to their adulthood.

The Copernican method makes predictions solely based on the observations, because it is exactly the Bayes Rule by using an uninformative prior (in other words, we have no prior information or we do not know how to incorporate the prior information).

After we have learned how to incorporate a prior information, the core question of making a prediction now is what prior we should use.

Simple Prediction Rules

We introduce three simple prediction rules as follows:

Average Rule

If our prior information follows normal distribution (things that tend toward or cluster around some kind of natural value), we use the distribution's "natural" average — its single, specific scale — as your guide. For example, average life span for men in US is 76 years. So, for a 6-year-old boy, the prediction is 77. He gets a tiny edge over the population average of 76 by the virtue of having made it through infancy because we know he is not in the distribution's left tail.

Multiplicative Rule

If our prior information shows that the rare events have tremendous impact (power-law distribution)

then multiply the quantity observed so far by some constant factor.

For example, movie box-office grosses do not cluster to a "natural" center. Instead, most movies don't make much money at all, but the occasional Titanic makes titanic amounts. Income (or money in general) is also follows the power-laws (the rich get richer).

For the grosses of movies, the multiplier happens to be 1.4. So if you hear a movie has made 6 million so far, you can guess it will make about 8.4 million overall. This multiplicative rule is a direct consequence of the fact that power-law distributions do not specify (or you are uninformative on) a natural scale for the phenomenon they are describing. It is possible that a movie that is grossed 6 million is actually a blockbuster in its first hour of release, but it is far more likely to be just a single-digit-millions kind of movie.

In short, something normally distributed that's gone on seemingly too long is bound to end shortly; but the longer something in a power-law distribution has gone on, the longer you can expect it to keep going.

Additive Rule

There's a third category of things in life: those that are neither more nor less likely to end just because they have gone on for a while. If the prior follows Erlang distributions Links to an external site.(e.g. exponential or gamma distributions), then we should always predict that things will go on just a constant amount longer. It is a memoryless prediction. Additive rule is suitable for events that are completely independent from one another and the intervals between them thus fall on Erlang distribution.


With all these rules, we consider a cocncrete exammple that if you wait for a win at the roulette wheel,

What is your strategy?

What is your strategy by applying the above rules separately?

Answer

Suppose a win at the roulette wheel were characterized by a normal distribution, then the Average Rule would apply:after a run of bad luck, it'd tell you that your number should be coming any second, probably followed by more losing spins. The strategy is to quit after winning.

If, instead, the wait for a win obeyed a power-law distribution, then the Multiplicative Rule would tell you that winning spins follow quickly after one another, but the longer a drought had gone on the longer it would probably continue. The strategy is keep playing for a while after any win, but give up after a losing streak.

If it is a memoryless distribution, then you're stuck. The Additive Rule tells you the chance of a win now is the same as it was an hour ago, and the same as it will be an hour from now. Nothing ever changes. You're not rewarded for sticking it out and ending on a high note; neither is there a tipping point when you should just cut your losses. For a memoryless distribution, there is no right time to quit. This may in part explain these games’ addictiveness.


In summary, knowing what distribution you're up against (i.e. having a correct prior knowledge) can make all the difference.

Small Data is Big Data in Disguise: Many behavior studies show that the predictions that people had made were extremely close to those produced by Bayes Rule. The reason we can often make good predictions from a small number of observations — or just a single one — is that our priors are so rich. Behavior studies show that we actually carry around in our heads surprisingly accurate priors. However, the challenge has increased with the development of the printing press, the nightly news, and social media. Those innovations allow us to spread information mechanically which may affect our priors. As sociologist Barry Glassner notes, the murder rate in the United States declined by 20% over the course of the 1990s, yet during that time period the presence of gun violence on American news increased by 600%. If you want to naturally make good predictions by following your intuitions, without having to think about what kind of prediction rule is appropriate, you need to protect your priors. Counter-intuitively, that might mean turning off the news.

Bayesian Estimation

There are three approaches to estimate a population parameter $\theta$ with independent observations $\mathcal{D}$: Maximum Likelihood Estimation (MLE), Maximum A Posteriori (MAP), and Bayesian Estimation (BE).

MLE

We find the largest value to the likelihood

\begin{align*} \hat{\theta}_{\text{ML}} = \text{argmax}_{\theta} ~p(\mathcal{D}|\theta) = \prod_{i=1}^{n} p(\mathbf{x}_i|\theta) \end{align*}

MLE for Gaussian:

Suppose that a data set of observations $\mathbf{x} = (x_1, . . . , x_n)^\intercal$, representing $n$ observations of the scalar variable $x$, are drawn independently from a Gaussian distribution whose mean $\mu$ and variance $\sigma^2$ are unknown, and we would like to determine these parameters from the data set.

Because the data set $\mathbf{x}$ is i.i.d., the probability

\begin{align*} p(\mathbf{x}|\mu, \sigma^2) = \prod_{i=1}^{n} \mathcal{N} (x_i|\mu, \sigma^2) \end{align*}

gives the likelihood function and

\begin{align} \ln p(\mathbf{x}|\mu, \sigma^2) = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 - \frac{n}{2} \ln \sigma^2 - \frac{n}{2} \ln(2 \pi) \label{eqn_normal_ml} \end{align}

Maximizing \eqref{eqn_normal_ml} with respect to $\mu$, we obtain the maximum likelihood solution

\begin{align*} \mu_{ML} = \frac{1}{n} \sum_{i=1}^{n} x_i \end{align*}

which is the sample mean. Maximizing \eqref{eqn_normal_ml} with respect to $\sigma^2$, we obtain

\begin{align*} \sigma^2_{ML} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_{ML})^2 \end{align*}

which is the sample variance measured with respect to the sample mean. Note that $\sigma^2_{ML}$ is a biased estimate of the true variance $\sigma^2$.

The plot shows the maximum likelihood estimation for the Gaussian.

x_train = stats.norm.rvs(scale=3., size=(100, 2))

mu_ml = np.mean(x_train, axis=0)
cov_ml = np.atleast_2d(np.cov(x_train.T, bias=True))

plt.figure(figsize=(5, 3))
x, y = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100))
p = stats.multivariate_normal.pdf(np.array([x, y]).reshape(2, -1).T, mean=mu_ml, cov=cov_ml).reshape(100, 100)
plt.scatter(x_train[:, 0], x_train[:, 1], facecolor="none", edgecolor="steelblue")
plt.contour(x, y, p)
plt.show()

Mixtures of Gaussians:

Consider the sample as shown in figure below. We see that the data set forms two dominant clumps, and that a simple Gaussian distribution is unable to capture this structure, whereas a linear superposition of two Gaussians gives a better characterization of the data set.

By using a sufficient number of Gaussians, and by adjusting their means and covariances as well as the coefficients in the linear combination, almost any continuous density can be approximated to arbitrary accuracy.

We therefore consider a superposition of $K$ Gaussian densities of the form

\begin{align*} p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N} (\mathbf{x} | \mu_k, \Sigma_k) \end{align*}

where

\begin{align*} \sum_{k=1}^{K} \pi_k = 1 \text{ and } 0 \le \pi_k \le 1 \end{align*}

Due to the summation $\sum_{k=1}^{K}$, the maximum likelihood solution for the parameters

\begin{align*} \ln p(\mathbf{x_1}, \ldots, \mathbf{x_n}| \pi, \mu, \Sigma) = \sum_{i=1}^{n} \ln \left\{ \sum_{k=1}^{K} \pi_k \mathcal{N} (\mathbf{x}_i | \mu_k, \Sigma_k) \right\} \end{align*}

no longer has a closed-form analytical solution.

n = 10000
np.random.seed(12345)
# Parameters of the mixture components
norm_params = np.array([[1, 1.3],
                        [9, 3]])
n_components = norm_params.shape[0]
# Weight of each component, in this case all of them are 1/2
weights = np.ones(n_components, dtype=np.float64) / 2.0
# A stream of indices from which to choose the component
mixture_idx = np.random.choice(len(weights), size=n, replace=True, p=weights)
# y is the mixture sample
y = np.fromiter((stats.norm.rvs(*(norm_params[i])) for i in mixture_idx), dtype=np.float64)

# Theoretical PDF plotting -- generate the x and y plotting positions
xs = np.linspace(y.min(), y.max(), 200)
ys = np.zeros_like(xs)

for (l, s), w in zip(norm_params, weights):
    ys += stats.norm.pdf(xs, loc=l, scale=s) * w
    
plt.figure(figsize=(6, 4))
plt.plot(xs, ys)
plt.hist(y, density=True, bins="fd")
plt.xlabel("x")
plt.ylabel("f(x)")
plt.show()

The next plot fits data into mixtures of 3 Gaussians by using sklearn package.

from sklearn import mixture

x1 = np.random.normal(size=(100, 2))
x1 += np.array([-5, -5])
x2 = np.random.normal(size=(100, 2))
x2 += np.array([5, -5])
x3 = np.random.normal(size=(100, 2))
x3 += np.array([0, 5])
X = np.vstack((x1, x2, x3))

x_test, y_test = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100))
X_test = np.array([x_test, y_test]).reshape(2, -1).T

plt.figure(figsize=(6, 4))
plt.scatter(X[:, 0], X[:, 1], s=3)
Z = np.exp(mixture.GaussianMixture(n_components=3, covariance_type='full').fit(X).score_samples(X_test)).reshape(-1, 100)
plt.contour(x_test, y_test, Z)
plt.xlim(-8, 8)
plt.ylim(-8, 8)
plt.show()
MAP

We find the largest value to the posterior

\begin{align*} \hat{\theta}_{\text{MAP}} = \text{argmax}_{\theta} ~p(\theta|\mathcal{D}) = \text{argmax}_{\theta} \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})} = \text{argmax}_{\theta} ~ p(\mathcal{D}|\theta) p(\theta) \end{align*}
BE

We first calculates the posterior distribution $p(\theta|\mathcal{D})$ fully where

\begin{align*} p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})}. \end{align*}

Then the estimate $\hat{\theta}$ can be obtained by the mean, $\textrm{E}_{\theta}(p(\theta|\mathcal{D}))$, or mode of the distribution.


Consider a biased coin. Let $\theta$ be the probability of showing head, i.e., $p(x=1|\theta) = \theta$, for the biased coin. If one toss comes up head, we can estimate $\theta$ as follows:

MLE

We have \begin{align*} \hat{\theta} = \text{argmax}_{\theta} ~ p(x=1|\theta) = \text{argmax}_{\theta} ~ \theta = 1 \end{align*} We simply maximize likelihood function (no need to use log).

MAP

Suppose the prior $p(\theta) = 1$ is uniform distribution on $[0,1]$ \begin{align*} p(\theta|x=1) &= \frac{p(x=1|\theta)p(\theta)}{p(x=1)} = \frac{p(x=1|\theta)}{p(x=1)} \\ \Rightarrow \hat{\theta} &= \text{argmax}_{\theta} ~ p(\theta|x=1) = \text{argmax}_{\theta} ~ p(x=1|\theta) = \text{argmax}_{\theta} ~ \theta = 1 \end{align*} We note that uniform prior cannot solve over-fitting.

Now, suppose the prior $p(\theta) = \text{Beta}(\theta|a,b)$. Because of conjugacy, the posterior $p(\theta|x=1) = \text{Beta}(\theta|a+1,b)$. \begin{align*} \hat{\theta} = \text{argmax}_{\theta} ~ p(\theta|x=1) = \text{argmax}_{\theta} \theta^a (1-\theta)^{b-1} = \frac{a}{a+b-1} \end{align*}

BE

Suppose $p(\theta) = 1$ is uniform distribution on $[0,1]$. Note that \begin{align*} p(x=1) = \int_{0}^{1} p(x=1|\theta)p(\theta) d\theta = \int_{0}^{1} \theta d\theta = \frac{1}{2} \end{align*}

The posterior distribution is \begin{align*} p(\theta|x=1) &= \frac{p(x=1|\theta)p(\theta)}{p(x=1)} = \frac{p(x=1|\theta)}{p(x=1)} = 2 \theta \\ \hat{\theta} &= E(\theta|x=1) = \int_0^1 \theta p(\theta|x=1) d\theta = \int_0^1 2\theta^2 d\theta = \frac{2}{3} \end{align*}


Both MLE and MAP return only single and specific values for the parameter $\theta$.

  • MAP estimation "pulls" the estimate toward the prior.
  • The more focused (i.e., centered around mean with low variance) our prior belief, the larger the pull toward the prior.
  • The prior may have parameters $p(\theta; \alpha)$ plays a "smoothing role". For example, $\alpha$ is hyperparameter that can be found by evidence approximate.

Bayesian estimation, by contrast, calculates fully the posterior distribution $p(\theta|\mathcal{D})$. Note that the probability of evidence $p(\mathcal{D})$ is a constant where \begin{align*} p(\mathcal{D}) = \int p(\mathcal{D}|\theta) p(\theta) d\theta. \end{align*}

$p(\mathcal{D})$ is ignored in MAP but is considered in Bayes. $\theta$ can have many possible estimate values following the estimated posterior distribution. For example,

  • We may choose $\textrm{E}(\theta)$ assuming $\textrm{Var}(\theta)$ is small.
  • We may declare no good estimate if $\textrm{Var}(\theta)$ is too large, since $\textrm{Var}(\theta)$ expresses confidence.

Adaptive Learning

Learning with Gaussian Prior

Consider a single Gaussian random variable $x$ with known variance $\sigma^2_x$ (or precision $\gamma_x = \sigma^{-2}_x$). Our task is to infer its mean $\mu$ given a set of $n$ observations $\{x_1, \ldots, x_n\}$, where the prior distribution is assumbed to be Gaussian $f(\mu) = \mathcal{N} (\mu | \mu_0, \sigma^2_0 = \gamma_0^{-1})$.

Iterative Form

After $n$ observations, we can learn that $\mu \sim \mathcal{N} (\mu | \mu_n, \sigma^2_n = \gamma_n^{-1})$ where \begin{align*} \mu_{n} &= \frac{\gamma_{n-1} \mu_{n-1} + \gamma_x x_{n}}{\gamma_{n-1} + \gamma_x} \\ \gamma_{n} &= \gamma_{n-1} + \gamma_x \end{align*}

Technical Details

The likelihood function is \begin{align*} p(\{x_1, \ldots, x_n\} | \mu) = \prod_{i=1}^{n} p(x_i | \mu) = \frac{1}{(2\pi \sigma^2_x)^{n/2}} \exp \left\{ -\frac{1}{2 \sigma^2_x} \sum_{i=1}^{n} (x_i - \mu)^2\right\} \end{align*}

The posterior distribution is given by \begin{align*} p(\mu|\{x_1, \ldots, x_n\}) \propto p(\{x_1, \ldots, x_n\} | \mu) p(\mu) \end{align*}

where $p(\mu) = \mathcal{N} (\mu | \mu_0, \sigma^2_0)$ is our prior. It gives

\begin{align*} p(\mu|\{x_1, \ldots, x_n\}) \propto \frac{1}{(2\pi \sigma^2_x)^{n/2}} \exp \left\{ -\frac{1}{2 \sigma^2_x} \sum_{i=1}^{n} (x_i - \mu)^2\right\} \frac{1}{(2\pi \sigma^2_0)^{1/2}} \exp \left\{ -\frac{(\mu - \mu_0)^2}{2 \sigma^2_0}\right\} \end{align*}

The exponent takes the form \begin{align} -\frac{1}{2 \sigma^2_x} \sum_{i=1}^{n} (x_i - \mu)^2 -\frac{(\mu - \mu_0)^2}{2 \sigma^2_0} = -\frac{\mu^2}{2} \left( \frac{1}{\sigma^2_0} + \frac{n}{\sigma^2_x} \right) + \mu \left( \frac{\mu_0}{\sigma^2_0} + \frac{1}{\sigma^2_x} \sum_{i=1}^{n} x_i \right) \tag{1.1} \end{align}

Conjugacy: We would prefer to choose a prior such that the posterior distribution have the same functional form as the prior. This property is called conjugacy, and the prior is called conjugate prior. If the posterior distribution have the same functional form as the prior, then the Bayes rule can be easily applied recursively.

Because of conjugacy, we have \begin{align*} p(\mu|\{x_1, \ldots, x_n\}) = \mathcal{N}(\mu_n, \sigma_n^2) \end{align*}

and the exponent of $\mathcal{N}(\mu_n, \sigma_n^2)$ takes the form \begin{align} -\frac{1}{2\sigma^2_n} (\mu - \mu_n)^2 = -\frac{\mu^2}{2 \sigma^2_n} + \frac{\mu \mu_n}{\sigma^2_n} + \text{constant} \tag{1.2} \end{align}

Since both (1.1) and (1.2) represent the posterior distribution. They should have the same coefficient for terms $\mu^2$ and $\mu$, respectively.

Compare \begin{align} -\frac{1}{2 \sigma^2_x} \sum_{i=1}^{n} (x_n - \mu)^2 -\frac{(\mu - \mu_0)^2}{2 \sigma^2_0} &= -\frac{\mu^2}{2} \left( \frac{1}{\sigma^2_0} + \frac{n}{\sigma^2_x} \right) + \mu \left( \frac{\mu_0}{\sigma^2_0} + \frac{1}{\sigma^2_x} \sum_{i=1}^{n} x_i \right) \nonumber \\ -\frac{1}{2\sigma^2_n} (\mu - \mu_n)^2 &= -\frac{\mu^2}{2 \sigma^2_n} + \frac{\mu \mu_n}{\sigma^2_n} + \text{constant} \nonumber \end{align}

We get \begin{align*} \frac{1}{\sigma_n^2} &= \frac{n}{\sigma^2_x} + \frac{1}{\sigma_0^2} \\ \frac{\mu_n}{\sigma^2_n} &= \frac{\mu_0}{\sigma^2_0} + \frac{1}{\sigma^2_x} \sum_{i=1}^{n} x_i. \end{align*}

After $n$ observations, we have learned that $\mu$ follows $\mathcal{N} (\mu | \theta_n, \sigma^2_n = \gamma_n^{-1})$ where

\begin{align*} &\left\{ \begin{aligned} \frac{1}{\sigma_n^2} &= \frac{n}{\sigma^2_x} + \frac{1}{\sigma_0^2} \\ \mu_n &= \left( \frac{n}{\sigma^2_x} + \frac{1}{\sigma_0^2} \right)^{-1} \left( \frac{\mu_0}{\sigma^2_0} + \frac{1}{\sigma^2_x} \sum_{i=1}^{n} x_i \right) \end{aligned}\right. \\ \Rightarrow &\left\{ \begin{aligned} \gamma_n &= n \gamma_x + \gamma_0 \\ \mu_n &= \frac{\mu_0 \gamma_0 + \gamma_x \sum_{i=1}^{n} x_i}{\gamma_n}. \end{aligned}\right. \end{align*}

The learning procedure is defined by the updating equation

\begin{align*} \mu_{n} &= \frac{\gamma_{n-1} \mu_{n-1} + \gamma_x x_{n}}{\gamma_{n-1} + \gamma_x} \\ \gamma_{n} &= \gamma_{n-1} + \gamma_x \end{align*}


The code shows the learning process.

  • The prior distribution is gaussian(0, 0.3)
  • The data set is generated from gaussian(0.8, 0.1)
  • After observing data sequentially, we can eventually learn the 'true' distribution gaussian(0.8, 0.1)
class gaussian:
    def __init__(self, mu, var):
        self.parameter = {}
        self.mu = mu
        self.var = var
        self.tau = 1/var
        
    @property
    def var(self):
        return self.parameter["var"]
    
    @property
    def tau(self):
        return self.parameter["tau"]
    
    @var.setter
    def var(self, var):
        self.parameter["var"] = var
        self.parameter["tau"] = 1 / var
    
    @tau.setter
    def tau(self, tau):
        self.parameter["tau"] = tau
        self.parameter["var"] = 1 / tau
        
mu = gaussian(0, 0.3)
data = gaussian(0.8, 0.1)

plt.figure(figsize=(5,4))
x = np.linspace(-1.2, 1.5, 100)
plt.plot(x, stats.norm.pdf(x, mu.mu, np.sqrt(mu.var)), label="n=0")

x_train = np.random.normal(loc=data.mu, scale=np.sqrt(data.var), size=10)

for beg, end in [[0,1],[1,2],[2,10]]:
    seq_x = x_train[beg:end]
    n = len(seq_x)
    mu_ml = np.mean(seq_x, 0)
    mu.mu = (data.var * mu.mu +  n * mu.var * mu_ml) / (data.var + n * mu.var)
    mu.tau = mu.tau + n * data.tau
    plt.plot(x, stats.norm.pdf(x, mu.mu, np.sqrt(mu.var)), label="n={}".format(end))

plt.xlim(-1.2, 1.5)
plt.ylim(0, 5)
plt.legend()
plt.show()

Learning with Unknown Variance

Consider a single Gaussian random variable $x$ with unknown variance $\sigma^2_x$ (or precision $\gamma_x = \sigma^{-2}_x$). Given the Gaussian prior $f(\mu) = \mathcal{N} (\mu | \mu_0, \sigma^2_0 = \gamma_0^{-1})$, our task is to infer its mean $\mu$ after observing a set of $n$ observations $\{x_1, \ldots, x_n\}$.

Iterative Form
\begin{align*} \mu_n &= \frac{\kappa_0\mu_0 + n\bar{x}}{\kappa_0 + n} \\ \kappa_n &= \kappa_0 + n \\ \alpha_n &= \alpha_0 + n/2 \\ \beta_n &= \beta_0 + \frac{\kappa_0 n (\bar{x} - \mu_0)^2}{2(\kappa_0 + n)} + \frac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 \end{align*}
Technical Details

The likelihood function is \begin{align*} p(\{x_1, \ldots, x_n\} | \mu, \gamma) = \prod_{i=1}^{n} p(x_i | \mu, \gamma) = \frac{(\gamma)^{n/2}}{(2\pi)^{n/2}} \exp \left\{ -\frac{\gamma}{2} \sum_{i=1}^{n} (x_i - \mu)^2\right\} \end{align*}

The conjugate prior is the normal-gamma distribution with density function as follows

\begin{align} f(\mu, \gamma| \mu_0, \kappa_0, \alpha_0, \beta_0) &= \frac{\beta_0^{\alpha_0} \sqrt{\kappa_0}}{\Gamma(\alpha_0) \sqrt{2\pi}} \gamma^{\alpha_0-1/2} \exp \left( -\beta_0 \gamma - \frac{\kappa_0 \gamma (\mu -\mu_0)^2}{2} \right) \nonumber \\ & \propto \gamma^{\alpha_0-1/2} \exp \left( -\gamma \beta_0 - \frac{\kappa_0 \gamma}{2} (\mu -\mu_0)^2 \right) \label{normal-gamma-form} \end{align}

where \begin{align*} \mathbb{E}(\mu) &= \mu_0 & \mathbb{E}(\gamma) &= \frac{\alpha_0}{\beta_0} \\ \mathbb{Var}(\mu) &= \frac{\gamma_0}{\kappa_0(\alpha_0-1)} & \mathbb{Var}(\gamma) &= \frac{\alpha_0}{\beta^2_0} \end{align*}

Let $\bar{x} = 1/n \sum_{i=1}^{n} x_i$. We have the following two equations \begin{align} \sum_{i=1}^{n} (x_i - \mu)^2 &= n (\mu - \bar{x})^2 + \sum_{i=1}^{n} (x_i - \bar{x})^2 \label{eq001} \\ \kappa_0 (\mu - \mu_0)^2 + n (\mu - \bar{x})^2 &= (\kappa_0 + n) \left( \mu - \frac{\kappa_0\mu_0 + n\bar{x}}{\kappa_0 + n}\right)^2 + \frac{\kappa_0 n (\bar{x} - \mu_0)^2}{\kappa_0 + n} \label{eq002} \end{align}

Because of equations \eqref{eq001} and \eqref{eq002}, the posterior can be derived as follows

\begin{align*} p(\mu_n, \gamma_n|\{x_1,\ldots,x_n\}) &= f(\mu, \gamma| \mu_0, \kappa_0, \alpha_0, \beta_0) p(\{x_1, \ldots, x_n\} | \mu, \gamma) \\ &\propto \gamma^{\alpha_0 + n/2 - 1/2} \exp \left( \frac{\gamma}{2} \sum_{i=1}^{n} (x_i - \mu)^2 -\beta_0 \gamma - \frac{\kappa_0 \gamma (\mu -\mu_0)^2}{2} \right) \\ &\propto \gamma^{\alpha_0 + n/2 - 1/2} \exp \left( - \gamma \left( \beta_0 + \frac{\kappa_0 n (\bar{x} - \mu_0)^2}{2(\kappa_0 + n)} + \frac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 \right) - \frac{(\kappa_0 + n)\gamma}{2} \left( \mu - \frac{\kappa_0\mu_0 + n\bar{x}}{\kappa_0 + n}\right)^2 + \right) \end{align*}

comparing with the normal-gamma distribution (see \eqref{normal-gamma-form}), we have

\begin{align*} \mu_n &= \frac{\kappa_0\mu_0 + n\bar{x}}{\kappa_0 + n} \\ \kappa_n &= \kappa_0 + n \\ \alpha_n &= \alpha_0 + n/2 \\ \beta_n &= \beta_0 + \frac{\kappa_0 n (\bar{x} - \mu_0)^2}{2(\kappa_0 + n)} + \frac{1}{2}\sum_{i=1}^{n} (x_i - \bar{x})^2 \end{align*}

Note that because of conjugacy, it is equivalent to have the iterative form as follows

\begin{align*} \mu_{n+t} &= \frac{\kappa_t\mu_t + n\bar{x}}{\kappa_t + n} \\ \kappa_{n+t} &= \kappa_t + n \\ \alpha_{n+t} &= \alpha_t + n/2 \\ \beta_{n+t} &= \beta_t + \frac{\kappa_t n (\bar{x} - \mu_t)^2}{2(\kappa_t + n)} + \frac{1}{2}\sum_{i=1}^{n} (x_{t+i} - \bar{x})^2 \end{align*}

where $\bar{x} = 1/n \sum_{i=1}^{n} x_{t+i}$.


Learning Correlated Gaussian Prior

A particularly important problem class in learning involves correlated multiple choices. Correlated beliefs are a particularly powerful device in learning, allowing us to generalize the results of a single observation to other alternatives that we have not directly measured.

Consider a multivariate Gaussian random variable $\mathbf{x}$ with known covariance matrix $\Sigma$. Our task is to infer the mean vector $\mu$ given a single observation on the $i$th component $x_i$ with known variance $\sigma^2_i$.

For example, a physician is trying to treat diabetes using a treatment of three drugs, where she observes the drop in blood sugar from a course of a particular treatment. If one treatment produces a better-than-expected response, this would also increase our belief of the response from other treatments that have one or two drugs in common.

Suppose the prior is $p(\mu) = \mathcal{N}(\theta_{0}, \Sigma_{0})$. Then, the posterior distribution $p(\theta|x_i) = \mathcal{N}(\mu_{(1)}, \Sigma_{(1)})$ with updating equations

\begin{align*} \Sigma^{-1}_{(1)} & = \Sigma^{-1}_0 + \frac{1}{\sigma^2_i} e_i e_i^\intercal \\ \mu_{(1)} & = \Sigma_{(1)} \left( \Sigma^{-1}_0 \mu_0 + \frac{1}{\sigma^2_i} x_i e_i \right) \end{align*}

where the subscript (1) indicating the first step learning and $e_i$ is the unit vector.

Let $\mathbf{a} = (a_1,\ldots,a_m)$ be the outcome of $m$ alternatives that $a \sim \mathcal{N}(\pmb{\mu}, S \equiv \Sigma^{-1})$ where outcomes of different alternatives are correlated. Our task is to estimate $\pmb{\mu}$.

For example, a physician is trying to treat diabetes using a treatment of three druges. If the treatment shows a good response, this would also increase the belief of the outcome from other treatments using similar drugs.

Suppose the observations are $\{x_1,\ldots,x_n\}$. If the observation $x_i$ is the outcome of the $j$th alternative that we decide to measure. We denote $\Sigma^{-1}_i$ as a zero matrix except that $(\Sigma^{-1}_i)_{jj} = \beta_j$, where $\beta_j = \sigma^{-2}_j$ is $j$th alternative's precision, and $\mathbf{x}_i$ as a zero vector except that $(\mathbf{x}_i)_j = x_i$.

Suppose $m=3$ and $$\pmb{\mu} = \left[\begin{matrix} 20 \\16 \\22 \end{matrix} \right]$$

\begin{align*} && p(\pmb{\mu}|\{x_1, \ldots, x_n\}) &\propto p(\{x_1, \ldots, x_n\} | \pmb{\mu}) p(\pmb{\mu}) = \Pi_{i=1}^{n} p(x_i | \pmb{\mu}) p(\pmb{\mu}) \\ \Rightarrow && \ln p(\pmb{\mu}|\{x_1, \ldots, x_n\}) &= -\frac{1}{2} \sum_{i=1}^{n} (\mathbf{x}_{i} - \pmb{\mu})^\intercal \Sigma^{-1}_i (\mathbf{x}_{i} - \pmb{\mu}) - \frac{1}{2} (\pmb{\mu} - \pmb{\mu}_0)^\intercal \Sigma^{-1}_0 (\pmb{\mu} - \pmb{\mu}_0) + \text{const} \\ &&&= \frac{1}{2} \pmb{\mu}^\intercal \left( \sum_{i=1}^{n} \Sigma^{-1}_i + \Sigma^{-1}_0 \right) \pmb{\mu} + \pmb{\mu}^\intercal \left( \sum_{i=1}^{n} \Sigma^{-1}_i \mathbf{x}_i + \Sigma^{-1}_0 \pmb{\mu}_0 \right) + \text{const} \\ &&&= -\frac{1}{2} \left( \pmb{\mu} - \left( \sum_{i=1}^n \Sigma^{-1}_i + \Sigma^{-1}_0 \right)^{-1} \left( \Sigma^{-1}_0 \pmb{\mu}_0 + \sum_{i=1}^n \Sigma^{-1}_i \mathbf{x}_i \right)\right)^{\intercal} \left( \sum_{i=1}^n \Sigma^{-1}_i + \Sigma^{-1}_0 \right)^{-1} \\ &&& ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \left( \pmb{\mu} - \left( \sum_{i=1}^n \Sigma^{-1}_i + \Sigma^{-1}_0 \right)^{-1} \left( \Sigma^{-1}_0 \pmb{\mu}_0 + \sum_{i=1}^n \Sigma^{-1}_i \mathbf{x}_i \right)\right) + \text{const} \\ &&& \equiv \mathcal{N}\left(\pmb{\mu}^{(n)}, S^{(n)} \equiv (\Sigma^{(n)})^{-1}\right) \end{align*}\begin{align*} \pmb{\mu}^{(n)} &= (S^{(n)})^{-1} \left( \Sigma^{-1}_0 \pmb{\mu}_0 + \sum_{i=1}^n \Sigma^{-1}_i \mathbf{x}_i \right) \\ &= (S^{(n)})^{-1} \left( S^{(n-1)}\pmb{\mu}^{(n-1)} + \Sigma^{-1}_n \mathbf{x}_n \right) \\ &= \left( \Sigma^{(n-1)} - \frac{\Sigma^{(n-1)} e_j e^\intercal_j\Sigma^{(n-1)}}{\sigma^{2}_j + e^\intercal_j \Sigma^{(n-1)} e_j} \right) \left( S^{(n-1)}\pmb{\mu}^{(n-1)} + \Sigma^{-1}_n \mathbf{x}_n \right) \\ &= \pmb{\mu}^{(n-1)} + \frac{1}{\sigma^{2}_j + e^\intercal_j \Sigma^{(n-1)} e_j} \left( -\Sigma^{(n-1)} e_j \pmb{\mu}^{(n-1)}_j + \sigma^{-2}_j x_n \Sigma^{(n-1)} e_j \left(\sigma^{2}_j + \Sigma^{(n-1)}_{jj} \right) - \sigma^{-2}_j x_n \Sigma^{(n-1)} e_j e^\intercal_j\Sigma^{(n-1)} e_j \right) \\ &= \pmb{\mu}^{(n-1)} + \frac{x_n - \pmb{\mu}^{(n-1)}_j}{\sigma^{2}_j + e^\intercal_j \Sigma^{(n-1)} e_j} \Sigma^{(n-1)} e_j \end{align*}

$\mathbf{x}|x_1,\ldots,x_n \sim \mathcal{N}(\mu_n, \beta_n \equiv \Sigma^{-1}_n)$

\begin{align} (\mathbf{x}-\mu_n)^\intercal \Sigma^{-1}_n (\mathbf{x}-\mu_n) + (x-\mu)^\intercal \Sigma^{-1} (x-\mu) \end{align}

The Beta-Bernoulli Model

Our objective is to learn the probability that a certain event will occur. For example, in a medical setting, our observations might simply be whether or not a certain medical treatment is successful. Such an observation can be modeled as a Bernoulli random variable, which is equal to 1 (success) with probability $\theta$, and 0 (failure) with probability $1 - \theta$. The success probability $\theta$ is the unknown true value in this case.

To estimate $\theta$, the likelihood function $\text{Bern}(k|\theta)$ takes the form of the product of factors of the form $\theta^k(1-\theta)^{1-k}$ where $k \in \{0,1\}$ is the binary observation. If we choose a prior to be proportional to powers of $\theta$ and $(1-\theta)$, then the posterior distribution, which is proportional to the product of the prior and the likelihood function, will have the same functional form as the prior.

Let the prior be $\text{Beta}(\theta | a, b)$ with

\begin{align*} \text{Beta}(\theta | a, b) = \frac{\Gamma(a+b)}{\Gamma(a)+\Gamma(b)} \theta^{a-1} (1-\theta)^{b-1} \end{align*}

Recall that it has mean $a/(a+b)$ and variance $ab/((a+b)^2+(a+b+1))$.

Because

\begin{align*} \text{Bern}(k|\theta) \times \text{Beta}(\theta | a, b) & \propto \theta^k(1-\theta)^{1-k} \times \theta^{a-1}(1-\theta)^{b-1} \propto \theta^{(a+k)-1}(1-\theta)^{(b+(1-k))-1} \\ & \propto \text{Beta}(\theta | a+k, b+1-k) \end{align*}

The posterior distribution of $\theta$ is

\begin{align*} \underbrace{\text{Beta}(\theta | a+k, b+1-k)}_{posterior} \propto \underbrace{\text{Bern}(k|\theta)}_{likelihood} \times \underbrace{\text{Beta}(\theta | a, b)}_{prior} \end{align*}

For a single observation of the success (with $k=1$) or failure (with $k=0$), the equation shows

\begin{align*} E(\theta) &= \frac{a+k}{(a+k)+(b+1-k)} = \frac{a+k}{a+b+1} \end{align*}

our updated estimate of the probability $\theta$.

It is easy to note that, after observing a data set of $m$ successes and $l$ failures, the posterior distribution of $\theta$ is $\text{Beta}(\theta | m+a, l+b)$. So,

\begin{align*} E(\theta) &= \frac{m+a}{m+a+l+b} \\ \text{Var}(\theta) &= \frac{(m+a)(l+b)}{(m+a+l+b)^2+(m+a+l+b+1)} \end{align*}
x = np.linspace(0, 1, 100)

# generate 10,000 mu values from the prior distribution
prior = stats.beta.rvs(2, 2, size=10000)

# the likelihood follows bernoulli distribution with a given mu
likelihood = stats.bernoulli.rvs(prior, size=prior.shape)

posterior = (likelihood*prior)[(likelihood*prior)!=0]

fig, axes = plt.subplots(2, 2, figsize=(6,6))
axes[0,0].hist(prior, density=True, histtype='stepfilled', alpha=0.7, bins=50)
axes[0,1].plot(x, (stats.beta(2,2)).pdf(x))
axes[0,1].annotate("a={}".format(2), (0.05, 1.4))
axes[0,1].annotate("b={}".format(2), (0.05, 1.25))

axes[1,0].hist(posterior, density=True, histtype='stepfilled', alpha=0.7, bins=50)
axes[1,1].plot(x, (stats.beta(3,2)).pdf(x))
axes[1,1].annotate("a={}".format(3), (0.05, 1.7))
axes[1,1].annotate("b={}".format(2), (0.05, 1.55))
plt.show()
np.round(stats.beta.fit(posterior, floc=0, fscale=1)[:2],3)
print("Based on simulation, posterior beta distribution has [a,b]={}"\
      .format(np.round(stats.beta.fit(posterior, floc=0, fscale=1)[:2],3)))
Based on simulation, posterior beta distribution has [a,b]=[3.031 2.005]

Consider a problem where we learn as we go.

Example: Finding the Best Hitter

The only way to identify the best hitter on a baseball team is to put the hitter into the lineup. The performance of 3 candidates from previous games is given in the table.

Player No. hits No. At-Bats Average
A 36 100 0.360
B 1 3 0.333
C 7 22 0.318

Based on the information, you need to choose hitter(s) (same player twice or two different players) for two at-bats. For example, you can choose player B for both at-bats, or choose player A for the first at-bat and B for the second. For each at-bats, you get 1 if the hitter gets a hit, or 0 otherwise.

Which player to choose for the first at-bat?

Answer

The answer is player B.

The decision tree is drawn as follows:

The decision flow is as follows:

  • choose a player (A, B or C) for the first at-bat. they may get a hit or not based on the probability given in the table
  • choose a player (A, B or C) for the second at-bat (where player index A,B, and C are omitted in the decision tree)

The numbers (at the right end of the decision tree) show the expected value you got if a player is chosen. For example,

  • The first number 0.366 in red is the updated probablility for player A after A is chosen for the first at-bat and get a hit, where 0.366 = 37 / 101 because player A now has 37 hits in 101 at-bats (previously 36 hits in 100 at-bats in the table).
  • The second number 0.333 is the same probability for player B in the table. It stays the same because player B is not chosen for the first at-bat in this scenario.

Following the Beta-Bernoulli model,

  • Let $H^n$ be the number of hits a player has made in $n$ at-bats (e.g. $H^{100} = 36$ for player A)
  • Let $h_{n+1} = 1$ if a hitter gets a hit in his/her $n+1$-st at-bar

The prior is $$\mathbb{P}\left[h_{n+1} = 1| H^n, n\right] = \frac{H^n}{n}$$ which is $36/100$ for player A

Once we observe $h_{n+1}$, it is possible to show that the posterior is $$\mathbb{P}\left[h_{n+2} = 1| H^n, n, h_{n+1} \right] = \frac{H^n+h_{n+1}}{n+1}$$ Therefore, we have 0.366 = 37 / 101 as player A's posterior after A is chosen for the first at-bat and get a hit

In summary, all the values in red are the updated probability of the chosen players after their first at-bat. Hence, those probabilities can help us to determine which hitter to choose for the second at-bats (also the last one as we consider only two at-bats). For example:

  • if we chosen A for the first at-bat and A gets a hit, then we should choose A for the second at-bats because $0.366 = \max \{0.366, 0.333, 0.318\}$ (the first group in the decision tree). The expected score is 0.366$\times$1 = 0.366.

  • if we chosen C for the first at-bat and C gets a hit, then we should choose A for the second at-bats because $0.360=\max\{0.360, 0.333, 0.348\}$ (the second to the last group in the decision tree). The expected score is 0.360$\times$1 = 0.360.

Then, we have the following decision tree with the second at-bats decision replaced by the best expected value. The tree below shows the expected value of the best hitter after one at-bat.

The tree below shows the exptected value of each hitter before first at-bat.

Although player A initially has the hjghest batting average, our analysis says that we should try player B for the first at-bat. Because player B has had only three at-bats, it can be useful to collect information about choices where there is the greatest uncertainty.


Multivariate-Dirichlet Model

The Beta-Bernoulli model can be generalized to a multivariate setting. Suppose that, instead of a simple 0/1 value, each observation can be classified as belonging. For example, instead of merely measuring the success or failure of a medical treatment, we also consider cases where the treatment is generally effective, but causes certain side effects. Each observation can be classified as belonging to one of $K > 2$ different categories.

We want to learn the probabilities of $K$ alternatives $\theta = (\theta_1, \ldots, \theta_K)$, where the probability $\theta_k$ for the $k$th alternative, base on $n$ observations $\mathbf{x}^1, \ldots, \mathbf{x}^n$. The observations are represented in 1-of-$K$ scheme that

\begin{align*} \mathbf{x}^i = (0, \ldots, 0, \underbrace{1}_{k \text{th}}, 0, \ldots, 0)_{1 \times K} \end{align*}

indicates the $i$th observation is the $k$th alternative. Therefore, the distribution of an observation $x$ is given

\begin{align*} p(\mathbf{x}|\theta) = \prod_{k=1}^{K} \theta_k^{x_k} \end{align*}

Then, the likelihood function takes the form

\begin{align*} p(\mathbf{x}^1, \ldots, \mathbf{x}^n|\theta) = \prod_{i=1}^{n} \prod_{k=1}^{K} \theta_k^{\mathbf{x}^i_k} = \prod_{k=1}^{K} \theta_k^{\sum_{i=1}^n x^i_k} \end{align*}

The conjugate prior is given by Dirichlet distribution

\begin{align*} \text{Dir}(\theta| \alpha^{(0)}_0,\ldots,\alpha^{(0)}_K) = \frac{\Gamma(\alpha^{(0)}_0)}{\Gamma(\alpha^{(0)}_1) \cdots \Gamma(\alpha^{(0)}_K)} \prod_{k=1}^{K} \theta^{\alpha^{(0)}_k-1}_k \end{align*}

with $\alpha^{(0)}_0 = \sum_{i=1}^{K} \alpha^{(0)}_i$. The posterior distribution is in the form

\begin{align*} p(\mathbf{x}^1, \ldots, \mathbf{x}^n|\theta) \text{Dir}(\theta| \alpha^{(0)}_0,\ldots,\alpha^{(0)}_K) & \propto \text{Dir}\left(\theta| \alpha^{(0)}_0+n,\alpha^{(0)}_1+\sum_{i=1}^n \mathbf{x}^i_1, \ldots,\alpha^{(0)}_K+\sum_{i=1}^n \mathbf{x}^i_K\right) \\ & \equiv \text{Dir}\left(\theta| \alpha^{(n)}_0,\alpha^{(n)}_1, \ldots,\alpha^{(n)}_K \right) \end{align*}

Our estimate of the probability of observing an outcome in category $k$ given

\begin{align*} \theta^{(n)}_k = \frac{\alpha^{(n)}_k}{\alpha^{(n)}_1 + \cdots + \alpha^{(n)}_K} = \frac{\alpha_k+\sum_{i=1}^n \mathbf{x}^i_k}{\alpha_0+n} \end{align*}

Our belief is updated as follows

\begin{align*} \alpha^{(n)}_k = \left\{ \begin{aligned} &\alpha^{(n-1)}_k + 1 && \text{if observation belongs to category } k,\\ &\alpha^{(n-1)}_k && \text{otherwise.} \end{aligned} \right. \end{align*}

The Curse of Dimensionality

The severe difficulty that can arise in the data of many dimensions is sometimes called the curse of dimensionality. We can illustrate the difficulty as follows:

Illustration 1

Consider inputs uniformly distributed in a $p$-dimensional unit hypercube. We perform the nearest-neighbor procedure to estimate $f(x_0) = \text{Avg}_{x_i \in V} x_i$, where $V$ is a hypercubical neighborhood about the target point $x_0$ and captures a fraction $r$ of the observations.

The expected edge length of $V$ is $r^{1/p}$, since it includes a fraction $r$ of the unit volume. Note that, In ten dimensions ($p=10$), to capture 1% or 10% of the data to form a local average, we must cover 63%($=0.01^{0.1}$) or 80%($=0.1^{0.1}$) of the range of each input variable.

Apparently, those neighborhoods are no longer "local". Also, we cannot reduce $r$ dramatically as the fewer observations we average, the higher is the variance of our fit.

Illustration 2

Consider $n$ data points uniformly distributed in a $p$-dimensional unit ball centered at $x_0$, and we consider a nearest-neighbor estimate on $x_0$. The median distance from $x_0$ to the closest data point is

\begin{align*} d = \left(1 - \frac{1}{2}^{1/n} \right)^{1/p}. \end{align*}

$d$ is close to 1 as the dimentionality $p$ increase.

Proof of the median distance: The volume of a ball of radius $r$ in $\Re^p$ is $\omega_p r^p$, where $\omega_p$ is a constant depending only on $p$. The probability that a point, taken uniformly in the unit ball, is within distance $r$ of the origin is the volume of that ball divided by the volume of the unit ball. So the CDF is

\begin{align*} F(r) = r^p, \; \; \; 0 \leq r \leq 1. \end{align*}

The PDF of the smallest order statistic (the minimum) $y$ of $n$ points is

\begin{align*} g(y) = n \left( 1 - F(y) \right)^{n-1} f(y) = n \left( 1 - y^p \right)^{n-1} p y^{p-1} \end{align*}

and its CDF is

\begin{align*} G(y) = 1 - \left( 1 - y^p \right)^{n} \end{align*}

The median $d$ satisfies $G(d) = 1/2$.

Illustration 3

Let $V_p(r)$ be the volume of a $p$-dimensional ball of radius $r$. Another manifestation of the curse is that

\begin{align*} \frac{V_p(1)-V_p(1-\epsilon)}{V_p(1)} = 1 - (1-\epsilon)^p \end{align*}

For large $p$, the value tends to 1 even for small values of $\epsilon$. Thus, in spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface.

The results indicate that all sample points are close to an edge of the sample for the sparse sampling in high dimensions. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.


However, the curse of dimensionality does not prevent us from finding effective techniques applicable to high-dimensional data.

  • Real data will often be confined to a region of the space having lower effective dimensionality, and in particular the directions over which important variations in the target variables occur may be so confined.

  • Real data will typically exhibit some smoothness properties (at least locally) so that for the most part small changes in the input variables will produce small changes in the target variables, and so we can exploit local interpolation-like techniques to allow us to make predictions of the target variables for new values of the input variables.

Information Theory

Informtaion can be considered as assigning probablities on a set of different events. Hence, the measure of information content depends on a distribution, says $q(x)$. A knowledge "a pill normally takes effect after 3 hours" provides following information content

  1. the time that a pill takes effect for different people follows a certain distribution;

  2. the mean of such a distribution is 3 hours.

Generally speaking, we expect that a machine learning algorithm returns a distribution $p(x)$ which is "close" to $q(x)$. The question is how to measure the closeness of two distributions.

Entropy

Given a random state variable $x$, Our measure of information content will therefore depend on the probability distribution $p(x)$, and we look for a quantity $h(x)$ that is a function of the probability $p(x)$ and that expresses the information content.

An important quantity is called the entropy of the random variable $x$

\begin{align*} H[x] = - \sum_{x} p(x) \log_2 p(x) \end{align*}

where we sum over all possible state of $x$. Entropy is a measure of the average amount of information in terms of the number of bits.

For example, a encoding scheme.

letter Encoding
A 0
B 1
C 01

is short but extremely flawed, as the message "AB" and the message "C" are indistinguishable (both would be transmitted as 01). In contrast, the encoding scheme

letter Encoding
A 00001
B 00111
C 11111

is correct, but extremely inefficient because much more bits are necessary to reliably transmit messages.

It is natural to ask what the "best" encoding scheme is. Although the exact answer strongly depends on application, one natural answer is "the correct encoding scheme that minimizes the expected length of a message." That theoretical minimum is given by the entropy of the message.

For example, suppose a message details the value of a random variable $X$, defined by

$$x$$ $$\text{Pr}(X=x)$$
A 0.5
B 0.25
C 0.125
D 0.125

The entropy of this message is

\begin{align*} H[x] = - \frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{4} \log_2 \frac{1}{4} - \frac{1}{8} \log_2 \frac{1}{8} - \frac{1}{8} \log_2 \frac{1}{8} = 1.75 \text{ bits} \end{align*}

so the expected length of any correct encoding scheme is at least 1.75. In this case, this can be realized with a relatively simple scheme:

letter Encoding Length
A 1 1
B 01 2
C 001 3
D 000 3

It can be verified that the above encoding is correct (it is an example of Huffman coding), and it has expected length

\begin{align*} 1 \cdot 0.5 + 2 \cdot 0.25 + 3 \cdot 0.125 + 3 \cdot 0.125 = 1.75, \end{align*}

which is equal to the entropy of the message.

This relation between entropy and shortest coding length is a general one. The noiseless coding theorem (Shannon, 1948) states that the entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

The uniform distribution has a larger entropy than the nonuniform one. Consider a random variable $x$ having 4 possible states, each of which is equally likely. In order to communicate the value of $x$ to a receiver, we would need to transmit a message of length 2 bits

\begin{align*} H[x] = 4 \times - \frac{1}{4} \log_2 \frac{1}{4} = 2 \text{ bits} \end{align*}

which is greater than 1.75 bits obtained above from a nonuniform distribution.

The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge is the one with largest entropy.

Now we shall switch to the use of natural logarithms in defining entropy, as this is more convenient. The (differential) entry is

\begin{align*} H[x] = - \int p(x) \ln p(x) d x \end{align*}

What is the maximum entropy distribution:

over a finite range $[a,b]$

It is uniform distribution.

Proof: This is an unconstrained problem with Lagrangia

\begin{align*} \mathcal{L} (p(x), \lambda) = -\int_a^b p(x) ln p(x) d x + \lambda \left(\int_a^b p(x) d x - 1 \right) \end{align*}

where (because of calculus of variations)

\begin{align*} \frac{\partial \mathcal{L} (p(x), \lambda)}{\partial p(x)} = -1 - \ln p(x) + \lambda \Rightarrow p(x) = \exp(- 1+\lambda), \end{align*}

and $\lambda$ can be determined by

\begin{align*} \int_a^b p(x) d x = 1. \end{align*}

Since $p(x)$ is constant for different values of $x$. It is a uniform distribution in $[a,b]$

given mean $\mu$

It is exponential distribution.

Proof: The Lagrangia is

\begin{align*} \mathcal{L} (p(x), \lambda) = -\int_\Omega p(x) \ln p(x) d x + \lambda \left(\int_\Omega p(x) d x - 1 \right) + \alpha \left(\int_\Omega x p(x) d x - \mu \right) \end{align*}

where

\begin{align*} \frac{\partial \mathcal{L} (p(x), \lambda)}{\partial p(x)} = -1 - \ln p(x) + \lambda + \alpha x \Rightarrow p(x) = \exp(-1+\lambda+\alpha x), \end{align*}

with

\begin{align*} \int_\Omega p(x) d x = 1 \text{ and } \int_\Omega x p(x) d x = \mu \end{align*}

So,

\begin{align*} p(x) = \frac{1}{\mu} \exp \left( - \frac{x}{\mu} \right) \end{align*}

if $\Omega = [0, \infty)$. There is no solution if $\Omega = (-\infty, \infty)$.

given mean $\mu$ and variance $\sigma^2$

It is Gaussian distribution.

Proof: The Lagrangia is

\begin{align*} \mathcal{L} (p(x), \lambda) = & -\int_\Omega p(x) \ln p(x) d x + \lambda \left(\int_\Omega p(x) d x - 1 \right) \\ & + \alpha \left(\int_\Omega x p(x) d x - \mu \right) + \beta \left(\int_\Omega (x-\mu)^2 p(x) d x - \sigma^2 \right) \end{align*}

where

\begin{align*} & \frac{\partial \mathcal{L} (p(x), \lambda)}{\partial p(x)} = -1 - \ln p(x) + \lambda + \alpha x + \beta (x-\mu)^2 \\ \Rightarrow \quad & p(x) = \exp(-1+\lambda+\alpha x + \beta (x-\mu)^2), \end{align*}

with

\begin{align*} \int_\Omega p(x) d x = 1, \int_\Omega x p(x) d x = \mu \text{ and } \int_\Omega (x-\mu)^2 p(x) d x = \sigma^2. \end{align*}

We can identify $p(x)$ by solving those equations.


Relative Entropy

Relative entropy is also called Kullback-Leibler divergence

\begin{align*} \text{KL}(p||q) = - \int p(x) \ln \frac{q(x)}{p(x)} d x \end{align*}

We have $\text{KL}(p||q) \ge 0$ with equality if and only if $p(x) = q(x)$.

Thus, we can interpret the Kullback-Leibler divergence as a measure of the dissimilarity of the two distributions $p(x)$ and $q(x)$.

Suppose that data is being generated from an unknown distribution $p(x)$ that we wish to model. We can try to approximate this distribution using some parametric distribution $q(x|\theta)$, governed by a set of adjustable parameters $\theta$.

With a finite set of training points $x_1, \ldots, x_n$, we have

\begin{align*} \text{KL}(p||q) = - \int p(x) \ln \frac{q(x|\theta)}{p(x)} d x = - E\left( \ln \frac{q(x|\theta)}{p(x)} \right) \approx \frac{1}{n} \sum_{i=1}^{n} (- \ln q(x_i|\theta) + \ln p(x_i)) \end{align*}

Therefore, minimizing $\text{KL}(p||q)$ is equivalent to maximize

\begin{align*} \sum_{i=1}^{n} \ln q(x_i|\theta) \end{align*}

the likelihood function.