chtangwin

Dive into Stats/FinMath

Think Bayes

Notes from reading the online book Think Bayes: Bayesian Statistics Made Simple.

Chapter 5 Odds and Addends

In addition to normal Bayesian formula

$$ p(H|D) = \frac{p(D|H)p(H)}{p(D)} $$

The odds form of Bayes's theorem is

$$ o(A|D) = o(A) \frac{p(D|A)}{p(D|B)} $$

i.e., the posterior odds are the prior odds times the likelihood ratio. Sometimes, this form is most convenient for computing a Bayesian update on paper or in your head.

Chapter 6 Decision Analysis

The results from Bayesian analysis comes in the form of a posterior distribution. Classical method usually generates a single point estimate, e.g. maximum likelihood or MAP estimates.

Bayesian methods are most useful when you can carry the posterior distribution into the next step of the analysis to perform some kind of decision analysis, e.g. to compute an optimal bid in this chapter, or some kind of prediction seen in the next chapter.

Chapter 7 Prediction

The Boston Bruins Problem

In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston Bruins played a best-of-seven championship series against the despised Vancouver Canucks. Boston lost the first two games 0-1 and 2-3, then won the next two games 8-1 and 4-0. At this point in the series, what is the probability that Boston will win the next game, and what is their probability of winning the championship?

The following assumptions are used:

  1. goal scoring in hockey is at least approximately a Poisson process
  2. each team has some long-term average goals per game, denoted λ

Given these assumptions, the strategy for answering this question is

  1. Use statistics from previous games to choose a prior distribution for λ. From http://www.nhl.com, it is roughly Gaussian, i.e. $N(\mu, \sigma^2)$ with $\mu=2.7, \sigma=0.3$.

  2. Use the score from the first four games to estimate λ for each team.

  3. Use the posterior distributions of λ to compute distribution of goals for each team, the distribution of the goal differential, and the probability that each team wins the next game.

  4. Compute the probability that each team wins the series.

The following code shows the steps using pyMC library.

In [7]:
%matplotlib inline

import pymc as pm
import numpy as np
from IPython.core.pylabtools import figsize
from matplotlib import pyplot as plt
In [2]:
names = ['buins', 'canucks']
scores = [np.array([0,2,8,4]), np.array([1,3,1,0])]
N = len(scores)

lam   = [pm.Normal('lambda_%s' % names[i], mu=2.7, tau=1/0.3**2) for i in range(N)]
games = [pm.Poisson('games_%s' % names[i], mu=lam[i], value=scores[i], observed=True) for i in range(N)]
predicts = [pm.Poisson('predict_%s' % names[i], mu=lam[i]) for i in range(N)]
times = [pm.Exponential('time_%s' % names[i], beta=lam[i]) for i in range(N)]

mdl = pm.MCMC([lam, games, predicts, times])
mdl.sample(iter=20000, burn=10000)
 [-----------------100%-----------------] 20000 of 20000 complete in 5.9 sec
In [3]:
from pymc.Matplot import plot
plot(mdl.trace('lambda_buins'))
plot(mdl.trace('lambda_canucks'))
Plotting lambda_buins
Plotting lambda_canucks

In [4]:
buins_samples = mdl.trace('lambda_buins')[:]
canucks_samples = mdl.trace('lambda_canucks')[:]
In [5]:
figsize(12.5, 4)
plt.hist(buins_samples, histtype='stepfilled', bins=50, alpha=0.85,
         label="posterior of $\lambda$_buins", color="#A60628", normed=True)
plt.hist(canucks_samples, histtype='stepfilled', bins=50, alpha=0.85,
         label="posterior of $\lambda$_canucks", color="#467821", normed=True)

mb = buins_samples.mean()
mc = canucks_samples.mean()
plt.vlines(mb, 0, 1.5, lw=2, colors=u'm', label="$\lambda$_buins mean = %.2f" % mb)
plt.vlines(mc, 0, 1.5, lw=2, colors=u'k', label="$\lambda$_canucks mean = % .2f" % mc)

plt.legend(loc="upper right")
plt.title(r"Posterior distributions of $\lambda$s")
plt.xlim([1.5, 4.0])
plt.xlabel("$\lambda_1$ value")
Out[5]:
<matplotlib.text.Text at 0x1bf602e8>
In [6]:
#mdl.stats()

The sample mean for Buins and Canucks are 2.8, 2.5 respectively, so it seems Buins has advantage. (Note that the numbers are slightly different from the book: 2.6 for the Canucks and 2.9 for the Bruins.)

To compute the probability that each team wins the next game, one can use the simulated data from fitted model.

In [7]:
buins_predicts = mdl.trace('predict_buins')[:]
canucks_predicts = mdl.trace('predict_canucks')[:]
In [8]:
figsize(12.5, 4)
plt.hist(buins_predicts, histtype='stepfilled', bins=50, alpha=0.65,
         label="buins_predicts", color="#A60628", normed=True)
plt.hist(canucks_predicts, histtype='stepfilled', bins=50, alpha=0.65,
         label="canucks_predicts", color="#467821", normed=True)
plt.legend(loc="upper right")
plt.xlim([0, 12])
plt.title(r"Sample scores of next game")
Out[8]:
<matplotlib.text.Text at 0x1c15f7f0>
In [9]:
game = buins_predicts - canucks_predicts
num = len(game)
prob = len(game[game > 0])*100./num, len(game[game < 0])*100.0/num, len(game[game == 0])*100./num
print 'Team Buins vs Canucks, prob. of Winning: %.1f%%, Losing: %.1f%%, Tie: %.1f%%' % prob
Team Buins vs Canucks, prob. of Winning: 46.1%, Losing: 36.7%, Tie: 17.2%

Sudden death

In the case of Tie, the winner is determined by sudden death in overtime. The important statistic is not goals per game, but time until the first goal, which is in exponential distribution.

In [10]:
need_overtime = (game == 0)
buins_times = mdl.trace('time_buins')[:]
canucks_times = mdl.trace('time_canucks')[:]
overtime = (buins_times - canucks_times)[need_overtime]
prob2 = len(overtime[overtime < 0])*100./len(overtime)
print 'Team Buins wins in overtime is: %.1f%%' % prob2
print 'Team Buins vs Canucks, prob. of Winning on next game: %.1f%%' % (prob[0]+prob[2]*prob2/100.)
Team Buins wins in overtime is: 52.1%
Team Buins vs Canucks, prob. of Winning on next game: 55.0%

The results are very close to the values given in the book.

Discussion

So it turns out that the results are sensitive to the prior, which makes sense considering how little data we have to work with. Based on the difference between the low-variance model and the high-variable model, it seems worthwhile to put some effort into getting the prior right.

Chapter 8 Observer Bias

The Red Line Problem

The Bayesian estimation approach to help predict the wait time and decide when one should give up and take a taxi. The assumptions are:

  • Treat the passenger arrivals as a Poisson process, which means that passengers are equally likely to arrive at any time, and that they arrive at an unknown rate, λ, measured in passengers per minute. λ is assumed to be constant.
  • The arrival process for trains are not Poisson. The data is manually gathered.

    Note that the average time between trains, as seen by a random passenger ('zb'), is substantially higher than the true average ('z'). The exaplanation is in the book, because values from the actual distribution are oversampled in the proportion to their value.

In [1]:
%matplotlib inline

import pymc as pm
import numpy as np
from IPython.core.pylabtools import figsize
from matplotlib import pyplot as plt
In [2]:
# observed gaps between trains, in seconds
# collected using code in redline_data.py, run daily 4-6pm
# for 5 days, Monday 6 May 2013 to Friday 10 May 2013

OBSERVED_GAP_TIMES = [
    428.0, 705.0, 407.0, 465.0, 433.0, 425.0, 204.0, 506.0, 143.0, 351.0, 
    450.0, 598.0, 464.0, 749.0, 341.0, 586.0, 754.0, 256.0, 378.0, 435.0, 
    176.0, 405.0, 360.0, 519.0, 648.0, 374.0, 483.0, 537.0, 578.0, 534.0, 
    577.0, 619.0, 538.0, 331.0, 186.0, 629.0, 193.0, 360.0, 660.0, 484.0, 
    512.0, 315.0, 457.0, 404.0, 740.0, 388.0, 357.0, 485.0, 567.0, 160.0, 
    428.0, 387.0, 901.0, 187.0, 622.0, 616.0, 585.0, 474.0, 442.0, 499.0, 
    437.0, 620.0, 351.0, 286.0, 373.0, 232.0, 393.0, 745.0, 636.0, 758.0,
]
#plt.hist(OBSERVED_GAP_TIMES, bins=50)
data = np.array([x/60. for x in OBSERVED_GAP_TIMES])

from statsmodels.distributions import ECDF as empericalCDF
eCDF = empericalCDF(data)

figsize(12.5, 4)
plt.step(eCDF.x, eCDF.y)
plt.xlim([0, 20])
plt.title('Emperical cdf of RedLine observations')
Out[2]:
<matplotlib.text.Text at 0x19b6d518>

Next step is density (pdf) estimation. The code below is copied from the post Kernel Density Estimation in Python.

In [3]:
from sklearn.neighbors import KernelDensity
from scipy.stats import gaussian_kde
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.nonparametric.kernel_density import KDEMultivariate

def kde_scipy(x, x_grid, bandwidth=0.2, **kwargs):
    """Kernel Density Estimation with Scipy"""
    # Note that scipy weights its bandwidth by the covariance of the
    # input data.  To make the results comparable to the other methods,
    # we divide the bandwidth by the sample standard deviation here.
    kde = gaussian_kde(x, bw_method=bandwidth / x.std(ddof=1), **kwargs)
    return kde.evaluate(x_grid)

def kde_statsmodels_u(x, x_grid, bandwidth=0.2, **kwargs):
    """Univariate Kernel Density Estimation with Statsmodels"""
    kde = KDEUnivariate(x)
    kde.fit(bw=bandwidth, **kwargs)
    return kde.evaluate(x_grid)
      
def kde_statsmodels_m(x, x_grid, bandwidth=0.2, **kwargs):
    """Multivariate Kernel Density Estimation with Statsmodels"""
    kde = KDEMultivariate(x, bw=bandwidth * np.ones_like(x),
                          var_type='c', **kwargs)
    return kde.pdf(x_grid)

def kde_sklearn(x, x_grid, bandwidth=0.2, **kwargs):
    """Kernel Density Estimation with Scikit-learn"""
    kde_skl = KernelDensity(bandwidth=bandwidth, **kwargs)
    kde_skl.fit(x[:, np.newaxis])
    # score_samples() returns the log-likelihood of the samples
    log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
    return np.exp(log_pdf)

kde_funcs = [kde_statsmodels_u, kde_statsmodels_m, kde_scipy, kde_sklearn]
kde_funcnames = ['Statsmodels-U', 'Statsmodels-M', 'Scipy', 'Scikit-learn']

print "Package Versions:"
import sklearn; print "  scikit-learn:", sklearn.__version__
import scipy; print "  scipy:", scipy.__version__
import statsmodels; print "  statsmodels:", statsmodels.__version__
Package Versions:
  scikit-learn: 0.15.2
  scipy: 0.14.0
  statsmodels: 0.5.0

In [4]:
fig, ax = plt.subplots(1, 4, sharey=True, figsize=(12.5, 4))
fig.subplots_adjust(wspace=0)

xs = np.linspace(0, 20, 101)
for i in range(4):
    ax[i].hist(data, normed=True, label='data', bins=30)
    pdf = kde_funcs[i](data, xs, bandwidth=1)
    ax[i].plot(xs, pdf, lw=2, color='r')
    ax[i].set_xlim([0, 20])
    ax[i].set_title('pdf: ' + kde_funcnames[i])

All methods arrive at the same pdf, with bandwidth limited to 1. It is substantially different from the one in the book, even though the shape is similar.

In [5]:
N = 60. # use 1-second precision
xs = np.linspace(0, 20, 20*N+1)
pmf_z = kde_funcs[0](data, xs, bandwidth=1)

pmf_zb = xs*pmf_z
pmf_zb = pmf_zb/pmf_zb.sum()*N

pmf_y = np.zeros(20*N+1)
for i in range(int(20*N)):
    p = np.array([pmf_zb[i]/(i+1)]*(i+1) + [0]*int(20*N-i))
    pmf_y += p

cdf_z = np.cumsum(pmf_z)/N
cdf_zb = np.cumsum(pmf_zb)/N
cdf_y = np.cumsum(pmf_y)/N
In [6]:
fig, ax = plt.subplots(1, 2, figsize=(12.5, 4))

ax[0].plot(xs, pmf_z, label='z')
ax[0].plot(xs, pmf_zb, label='zb')
ax[0].plot(xs, pmf_y, label='y')
ax[0].legend()
ax[0].set_title('pmf of data')

ax[1].plot(xs,cdf_z, label='z')
ax[1].plot(xs, cdf_zb, label='zb')
ax[1].plot(xs, cdf_y, label='y')
ax[1].set_ylim([0, 1])
ax[1].legend()
ax[1].set_title('cdf of data')
Out[6]:
<matplotlib.text.Text at 0x1b3e72e8>
In [7]:
mean = lambda y: np.sum(xs*y)/N

print 'The mean of z = %.2f, of zb = %.2f, of c = %.2f' % (mean(pmf_z), mean(pmf_zb), mean(pmf_y))
The mean of z = 7.77, of zb = 8.85, of c = 4.42

Predicting wait times

Assume that the actual distribution of z is given, and the passenger arrival rate, λ, is 2 passengers per minute.

In [8]:
cdf = lambda p: np.cumsum(p)/N

prior_x = pmf_y

#posterior calculation
lam = 2.
k = 15
l_poisson = lambda x: np.exp(-lam*x)*((lam*x)**k)/np.prod(np.linspace(1,k,k))
yy = np.array([l_poisson(x) for x in xs])*prior_x
posterior_x = yy/np.sum(yy)*N

# y = zb - x
pred_y = np.zeros(len(xs))
for i in range(len(xs)):
    for j in range(i+1):
        pred_y[i-j] += pmf_zb[i]*posterior_x[j]
pred_y = pred_y/np.sum(pred_y)*N
In [9]:
figsize(6.25, 4)
plt.plot(xs, cdf(prior_x), label='prior_x')
plt.plot(xs, cdf(posterior_x), label='posterior_x')
plt.plot(xs, cdf(pred_y), label='pred_y')
plt.ylim([0, 1])
plt.legend(loc='upper right')
Out[9]:
<matplotlib.legend.Legend at 0x1aed1630>

Estimating the arrival rate

In [21]:
obs_k1 = np.array([17, 22, 23, 18, 4])
obs_y  = np.array([4.6, 1.0, 1.4, 5.4, 5.8])
obs_k2 = np.array([9, 0, 4, 12, 11])

# prior = Uniform[0, 5]
ts = xs[:5*N+1]
prior_lam = np.array([1./5.]*len(ts))

posterior_lam = prior_lam
l_poisson = lambda x, lam, k: np.exp(-lam*x)*((lam*x)**k)/np.prod(np.linspace(1,k,k))
for i in range(len(obs_y)):
    posterior_lam = np.array([l_poisson(obs_y[i], lam, obs_k2[i]) for lam in ts])
posterior_lam = posterior_lam/np.sum(posterior_lam)*N

figsize(6.25, 4)
plt.step(ts, cdf(prior_lam), label='prior $\lambda$')
plt.step(ts, cdf(posterior_lam), label='posterior $\lambda$')
#plt.xlim([0, 5])
plt.ylim([0, 1])
plt.legend(loc='upper left')
plt.title('estimate of cdf ($\lambda$)')
Out[21]:
<matplotlib.text.Text at 0x1c8fe5f8>

Now incorporate this uncertainty to estimate distribution of pred_y.

In [74]:
from numba import autojit
import copy

@autojit
def nb_calc(posterior_x):
    # y = zb - x
    pred_y = np.zeros(len(xs))
    for i in range(len(xs)):
        for j in range(i+1):
            pred_y[i-j] += pmf_zb[i]*posterior_x[j]
    pred_y = pred_y/np.sum(pred_y)*N
    return pred_y
    
def calc_pred_y(lam, k):
    ''' Compute pred_y with given data of (lam, k)
        Assumptions used: prior_x, pmf_zb, posterior_x
    '''        
    l_poisson = lambda x: np.exp(-lam*x)*((lam*x)**k)/np.prod(np.linspace(1,k,k))
    yy = np.array([l_poisson(x) for x in xs])*prior_x
    posterior_x = yy/np.sum(yy)*N
    pred_y = nb_calc(posterior_x)
    return pred_y

agg_pred_y = np.ones(len(xs))
# figsize(6.25, 4)
pltyy = []
for i in range(len(ts)):
    lam = ts[i]
    p = posterior_lam[i]
    k = 15 # num of passengers observed is fixed
    yy = calc_pred_y(lam, k)
    agg_pred_y = agg_pred_y*yy*p
    if i % 30 == 0:
        print 'calc pred_y for lam = ', i/60.
        pltyy.append(copy.deepcopy(yy))
    
agg_pred_y = calc_pred_y(2, 15)
for yy in pltyy:
        plt.plot(xs, cdf(yy), ls='-.', color='blue')
plt.plot(xs, cdf(agg_pred_y), lw=2, color='r', label='mix')
plt.xlim([0, 20])
plt.ylim([0, 1])
plt.legend(loc='upper left')
plt.xlabel('wait time (min)')
plt.ylabel('CDF')
calc pred_y for lam =  0.0
calc pred_y for lam =  0.5
calc pred_y for lam =  1.0
calc pred_y for lam =  1.5
calc pred_y for lam =  2.0
calc pred_y for lam =  2.5
calc pred_y for lam =  3.0
calc pred_y for lam =  3.5
calc pred_y for lam =  4.0
calc pred_y for lam =  4.5
calc pred_y for lam =  5.0

Out[74]:
<matplotlib.text.Text at 0x25c8b1d0>
In [75]:
cdf_agg_pred_y = cdf(agg_pred_y)
getProb = lambda x: cdf_agg_pred_y[list(xs).index(x)]*100.
print 'Prob. of waiting more than 15 mins after arrival: P(y >= 15) = %f%%.' % (100 - getProb(15.))
Prob. of waiting more than 15 mins after arrival: P(y >= 15) = 0.000035%.

Zero probability given the observations!! With adjusted sampling from the book, it is increased to about 0.02.

Bayesian

Comments