Beta-binomial Model
Here I am using the simplest beta-binomial model. Suppose we observed n reads covering a mismatch position i, of which m reads harbor an allele that is considered mutated. Mutation rate θi follows a beta distribution.
miθi∼Binom(ni,θi)∼Beta(αi,βi)
where αi and βi are hyperparameters in the beta function B().
We know that
θ∼Beta(α,β)=Beta(θ∣α,β)=B(α,β)θα−1(1−θ)β−1
B(α,β)B(α,β)=∫01tα−1(1−t)β−1dt(integral definition)=Γ(α+β)Γ(α)Γ(β)(relationship to gamma function)
We can marginalize out the latent variable θi, and the probability of having mi given ni, αi, βi becomes:
P(mi∣ni,αi,βi)=∫01Binom(mi∣ni,θi)⋅Beta(θi∣αi,βi)dθi=∫01(mini)θimi(1−θi)ni−mi⋅B(αi,βi)1θiαi−1(1−θi)βi−1dθi=(mini)∫01θimi+αi−1(1−θi)ni−mi+βi−1dθi⋅B(αi,βi)1=(mini)B(αi,βi)B(mi+αi,ni−mi+βi)
Reparameterization
We can also reparameterize α and β to μ and ρ:
μ=α+βα;α=ρμ(1−ρ);ρ=α+β+11β=ρ(1−μ)(1−ρ)
where μ corresponds to the estimation of θ, and ρ corresponds to the extent of overdispersion.
The expectation and variance of the beta-binomial distribution are:
E(m∣n,μ,ρ)=nμ;Var(m∣n,μ,ρ)=nμ(1−μ)[1+(n−1)ρ]
Maximum Likelihood Estimation (Frequentist)
We can estimate αi and βi by Maximum Likelihood Estimation (MLE). The likelihood function is:
ℓ(αi,βi)=i=1∏IP(mi∣αi,βi)=i=1∏I(mini)B(αi,βi)B(mi+αi,ni−mi+βi)
After reparameterization:
ℓ(μi,ρi)=i=1∏IP(mi∣μi,ρi)=i=1∏I(mini)B(ρiμi(1−ρi),ρi(1−μi)(1−ρi))B(mi+ρiμi(1−ρi),ni−mi+ρi(1−μi)(1−ρi))
We can get the point estimation by taking the derivative of the pdf and setting it to 0:
∇θiP(mi∣θi)θi^MLE=dθidP(mi∣θi)⋅(mini)θimi(1−θi)ni−mi=miθimi−1(1−θi)ni−mi+θimi(mi−ni)(1−θi)ni−mi−1=0=nimi
Bayesian Estimation
First, derive the posterior distribution for θi:
P(θi∣mi)=P(mi)P(mi∣θi)P(θi)=∫01P(mi∣θi)P(θi)dθiP(mi∣θi)P(θi)=∫01((mini)θimi+αi−1(1−θi)ni−mi+βi−1)dθi/B(αi,βi)(mini)θimi+αi−1(1−θi)ni−mi+βi−1/B(αi,βi)=B(mi+αi,ni−mi+βi)(mini)θimi+αi−1(1−θi)ni−mi+βi−1
where P(mi) is the marginal distribution of mi. The joint density P(θi,mi) with θi integrated out can also be viewed as P(θi∣mi)∝P(θi)P(mi∣θi).
To calculate the point estimator of Maximizing A Posterior (MAP), take the gradient of the probability density function and set it to 0:
∇θiP(θi∣mi)θi^MAP=dθidP(θi∣mi)⋅(mini)θimi+αi−1(1−θi)ni−mi+βi−1=θimi+αi−1(1−θi)ni−mi+βi−1((mi+αi−1)θi−1+(−ni+mi−βi+1)(1−θi)−1)=0=ni+αi+βi−2mi+αi−1
For Bayesian inference, we can get the following from the posterior distribution:
θi^∼Beta(mi+αi,ni−mi+βi)=Beta(θi^∣mi+αi,ni−mi+βi)
Note that the expected value of a beta random variable is:
E[θ]=∫01θfθ(θ)dθ=∫01θB(α,β)θα−1(1−θ)β−1dθ=∫01θα(1−θ)β−1dθ⋅B(α,β)1=B(α,β)B(α+1,β)(rewrite beta function in gamma function)=Γ(α)Γ(α+β+1)Γ(α+1)Γ(β)(by definition Γ(n)=(n−1)!)=α+βα
The variance is:
Var(θ)=E[θ2]−E[θ]2=(α+β+1)(α+β)2(α+1)α−(α+β)2α2=(α+β+1)(α+β)2αβ
Then we know the posterior expected value and posterior variance because αi+=mi, βi+=ni−mi:
E(θi^)=ni+αi+βimi+αi;Var(θi^)=(ni+αi+βi+1)(ni+αi+βi)2(mi+αi)(ni−mi+βi)
Empirical Bayesian Inference
Estimating a Prior from All Your Data
Now we can empirically estimate α0 and β0 for the prior beta distribution from all data points using MLE with the beta-binomial likelihood function:
L(α,β)=−logℓ(α,β)=−∑log(Γ(n+α+β)Γ(m+α)Γ(n−m+β)⋅Γ(α)Γ(β)Γ(α+β))(drop constants w.r.t. α and β)
Through reparameterization, we could adjust for the total number of reads on one mismatch site. Likewise, we could correct for other confounding factors such as batch, sex, age, etc. To achieve this, we are going to do regression:
θiμi∼Beta(μi,ρi)=σ−1(XTωi+ϵi)
where σ is the logit link function, X denotes a design matrix of confounding factors, ωi denotes coefficients for the covariates in X, and ϵ is the intercept. Potentially, the matrix X could be:
X∼log(n)+batch+group+age+…
Likewise, we can get the negative log likelihood with respect to μ and ρ:
L(μ,ρ)=−log(ℓ(μ,ρ))=−i=1∑Ilog(P(m∣μ,ρ))
Updating Each Data Point
Just like the derivation shown above, we can update θi by calculating the expectation of its posterior distribution, Expected A Posteriori (EAP), as the solution is closed-form:
αiβiθi^EAP=mi+α0=ni−mi+β0=ni+α0+β0mi+α0
Example Code
#!/usr/bin/env python
import numpy as np
from scipy import stats
from scipy.optimize import minimize
from scipy.special import expit, logit
import pandas as pd
import matplotlib.pyplot as plt
class BetaBinomEmpiricalBayes:
def __init__(self, optim_method='L-BFGS-B'):
self.optim_method = optim_method
def reparam_1(self, α, β):
μ = α / (α + β)
ρ = 1 / (α + β + 1)
return μ, ρ
def reparam_2(self, μ, ρ):
α = μ * (1 - ρ) / ρ
β = (1 - μ) * (1 - ρ) / ρ
return α, β
def neglog_likelihood(self, params, *args):
'''
Negative log likelihood for beta-binom pdf
- params: list for parameters to be estimated
- args: 1d array containing data points
μ = σ^{-1}(Χω + ε)
'''
ρ = params[0]
ε = params[1]
ω = params[2]
m = args[0]
n = args[1]
x = args[2]
μ = expit(np.dot(x, ω) + ε)
α, β = self.reparam_2(μ, ρ)
logpdf = stats.betabinom.logpmf(k=m, n=n, a=α, b=β, loc=0)
nll = -np.sum(logpdf)
return nll
def neglog_likelihood_ab(self, params, *args):
'''
Negative log likelihood for beta-binom pdf
- params: list for parameters to be estimated
- args: 1d array containing data points
'''
a = params[0]
b = params[1]
m = args[0]
n = args[1]
logpdf = stats.betabinom.logpmf(k=m, n=n, a=a, b=b, loc=0)
nll = -np.sum(logpdf)
return nll
def fit_betabinom(self, params_init, data, bounds):
'''Maximum likelihood estimation'''
res = minimize(self.neglog_likelihood,
x0 = params_init,
args = data,
method = self.optim_method,
bounds = bounds)
#self.params_bb_mle = res.x
return res
def post_reparam(self, ρ, ω, x, ε):
'''
convert μ, ρ (post MLE) back to α, β
'''
μ = expit(np.dot(x, ω) + ε)
α, β = self.reparam_2(μ, ρ)
return α, β
def update_alpha(self, m, α):
return m + α
def update_beta(self, m, n, β):
return n - m + β
def update_theta(self, α_1, β_1):
'''
use the estimated params to update each datapoints
posterior θ (expectation and variance)
m, n are arrays
'''
θ_expected = α_1 / (α_1 + β_1)
θ_var = (α_1 * β_1) / ((α_1 + β_1 + 1) * (α_1 + β_1)**2)
return θ_expected, θ_var
def credible_interval(self, alpha, beta):
'''95% interval'''
low = stats.beta.ppf(0.025, alpha, beta, loc=0, scale=1)
high = stats.beta.ppf(0.975, alpha, beta, loc=0, scale=1)
return low, high
def posterior_err_prob(self, alpha, beta, cutoff):
'''cumulative distribution function'''
try:
pep = stats.beta.cdf(cutoff, alpha, beta, loc=0, scale=1)
except:
pep = float('inf')
return pep
def q_value(self, pep):
'''cumulative mean of all sorted posterior error probabilities'''
s = pd.Series(pep).sort_values()
return s.cumsum()
def plot_beta_dist(self, a, b, save_to=None):
'''
code is copied from
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html
'''
mean, var, skew, kurt = stats.beta.stats(a, b, moments='mvsk')
print(f'mean={mean}, var={var}, skew={skew}, kurt={kurt}')
fig, ax = plt.subplots(1, 1)
x = np.linspace(stats.beta.ppf(0.01, a, b),
stats.beta.ppf(0.99, a, b), 100)
ax.plot(x, stats.beta.pdf(x, a, b),
'r-', lw=3.5, alpha=0.6, label='beta pdf')
r = stats.beta.rvs(a, b, size=1000)
ax.hist(r, density=True, bins=30, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.title(f'a={"%.4f" % a}, b={"%.4f" % b}')
if save_to:
plt.save(save_to)
else:
plt.show()
def run_bb_eBayes(table, ab_mode=False):
# table is a pandas table
m = table['m'].to_numpy()
n = table['n'].to_numpy()
log_n = np.log(n)
reg = BetaBinomEmpiricalBayes()
if not ab_mode:
# ρ, ε, ω
params_init = np.array([0.5, 0.5, 0.8])
# MLE
res = reg.fit_betabinom(params_init=params_init,
data=(m, n, log_n),
bounds=[(0.001, None),
(None, None),
(None, None)])
# update each datapoint
ρ, ε, ω = res.x
params = f'ρ={"%.4f" % ρ},ε={"%.4f" % ε},ω={"%.4f" % ω}'
α_0, β_0 = reg.post_reparam(ρ=ρ, ω=ω, x=log_n, ε=ε)
else:
# α, β
params_init = np.array([1, 1])
# MLE
res = reg.fit_betabinom(params_init=params_init,
data=(m, n),
bounds=[(0.001, None),
(0.001, None)])
# update each datapoint
α_0, β_0 = res.x
params = f'a={"%.4f" % α_0},b={"%.4f" % β_0}'
α_1 = reg.update_alpha(m, α_0)
β_1 = reg.update_beta(m, n, β_0)
θ_expected, θ_var = reg.update_theta(α_1, β_1)
table['α_1'] = ["%.3f" % a for a in α_1]
table['β_1'] = ["%.3f" % b for b in β_1]
table['freq'] = ["%.4f" % f for f in (m/n)]
table['eb_freq'] = ["%.4f" % e for e in θ_expected]
table['eb_var'] = ["{:.3e}".format(v) for v in θ_var]
table = table.sort_values(by=['eb_var']).\
sort_values(by=['eb_freq'], ascending=False)
return table, params