Beta-binomial empirical Bayesian modeling

Jan 9, 2023

Beta-binomial Model

Here I am using the simplest beta-binomial model. Suppose we observed nn reads covering a mismatch position ii, of which mm reads harbor an allele that is considered mutated. Mutation rate θi\theta_{i} follows a beta distribution.

miBinom(ni,θi)θiBeta(αi,βi)\begin{split} m_{i} &\sim \text{Binom}(n_{i}, \theta_{i}) \\ \theta_{i} &\sim \text{Beta}(\alpha_{i}, \beta_{i}) \end{split}

where αi\alpha_i and βi\beta_i are hyperparameters in the beta function B()\mathrm{B}().

We know that

θBeta(α,β)=Beta(θα,β)=θα1(1θ)β1B(α,β)\theta \sim \text{Beta}(\alpha, \beta) = \text{Beta}(\theta|\alpha, \beta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\text{B}(\alpha, \beta)} B(α,β)=01tα1(1t)β1dt(integral definition)B(α,β)=Γ(α)Γ(β)Γ(α+β)(relationship to gamma function)\begin{split} \mathrm{B}(\alpha, \beta) &= \int_0^1 t^{\alpha-1} (1-t)^{\beta-1} dt \quad \text{(integral definition)} \\ \mathrm{B}(\alpha, \beta) &= \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)} \quad \text{(relationship to gamma function)} \end{split}

We can marginalize out the latent variable θi\theta_{i}, and the probability of having mim_i given nin_i, αi\alpha_{i}, βi\beta_{i} becomes:

P(mini,αi,βi)=01Binom(mini,θi)Beta(θiαi,βi)dθi=01(nimi)θimi(1θi)nimi1B(αi,βi)θiαi1(1θi)βi1dθi=(nimi)01θimi+αi1(1θi)nimi+βi1dθi1B(αi,βi)=(nimi)B(mi+αi,nimi+βi)B(αi,βi)\begin{split} P(m_i|n_i,\alpha_i,\beta_i) & = \int_{0}^{1} \text{Binom}(m_i|n_i,\theta_i) \cdot \text{Beta}(\theta_i|\alpha_i,\beta_i) \, d\theta_i \\ & = \int_0^1 \binom{n_i}{m_i}\theta_{i}^{m_i}(1-\theta_i)^{n_i-m_i} \cdot \frac{1}{\text{B}(\alpha_i, \beta_i)} \theta_i^{\alpha_{i}-1}(1-\theta_i)^{\beta_{i}-1} \, d\theta_i \\ & = \binom{n_i}{m_i} \int_0^1 \theta_{i}^{m_i+\alpha_{i}-1}(1-\theta_i)^{n_i-m_i+\beta_{i}-1} \, d\theta_i \cdot \frac{1}{\text{B}(\alpha_{i}, \beta_{i})} \\ & = \binom{n_i}{m_i} \frac{\text{B}(m_i+\alpha_{i}, n_i-m_i+\beta_{i})}{\text{B}(\alpha_{i}, \beta_{i})} \end{split}

Reparameterization

We can also reparameterize α\alpha and β\beta to μ\mu and ρ\rho:

μ=αα+β;  ρ=1α+β+1α=μ(1ρ)ρ;  β=(1μ)(1ρ)ρ\begin{split} \mu = \frac{\alpha}{\alpha+\beta}; & \; \rho = \frac{1}{\alpha+\beta+1} \\ \alpha = \frac{\mu(1-\rho)}{\rho}; & \; \beta = \frac{(1-\mu)(1-\rho)}{\rho} \end{split}

where μ\mu corresponds to the estimation of θ\theta, and ρ\rho corresponds to the extent of overdispersion.

The expectation and variance of the beta-binomial distribution are:

E(mn,μ,ρ)=nμ;Var(mn,μ,ρ)=nμ(1μ)[1+(n1)ρ]\mathbb{E}(m|n,\mu,\rho) = n\mu; \quad \text{Var}(m|n,\mu,\rho) = n\mu(1-\mu)[1+(n-1)\rho]

Maximum Likelihood Estimation (Frequentist)

We can estimate αi\alpha_{i} and βi\beta_{i} by Maximum Likelihood Estimation (MLE). The likelihood function is:

(αi,βi)=i=1IP(miαi,βi)=i=1I(nimi)B(mi+αi,nimi+βi)B(αi,βi)\ell(\alpha_{i},\beta_{i}) = \prod_{i=1}^{I} P(m_{i}|\alpha_{i},\beta_{i}) = \prod_{i=1}^{I} \binom{n_{i}}{m_{i}} \frac{\text{B}(m_{i}+\alpha_{i}, n_{i}-m_{i}+\beta_{i})}{\text{B}(\alpha_{i}, \beta_{i})}

After reparameterization:

(μi,ρi)=i=1IP(miμi,ρi)=i=1I(nimi)B(mi+μi(1ρi)ρi,nimi+(1μi)(1ρi)ρi)B(μi(1ρi)ρi,(1μi)(1ρi)ρi)\ell(\mu_{i},\rho_{i}) = \prod_{i=1}^{I} P(m_{i}|\mu_{i},\rho_{i}) = \prod_{i=1}^{I} \binom{n_{i}}{m_{i}} \frac{\mathrm{B}\left(m_{i}+\frac{\mu_{i}(1-\rho_{i})}{\rho_{i}}, n_{i}-m_{i}+\frac{(1-\mu_{i})(1-\rho_{i})}{\rho_{i}}\right)}{\mathrm{B}\left(\frac{\mu_{i}(1-\rho_{i})}{\rho_{i}}, \frac{(1-\mu_{i})(1-\rho_{i})}{\rho_{i}}\right)}

We can get the point estimation by taking the derivative of the pdf and setting it to 0:

θiP(miθi)=dP(miθi)dθi(nimi)θimi(1θi)nimi=miθimi1(1θi)nimi+θimi(mini)(1θi)nimi1=0θi^MLE=mini\begin{split} \nabla_{\theta_{i}} P(m_{i}|\theta_{i}) &= \frac{dP(m_{i}|\theta_{i})}{d\theta_{i}} \cdot \binom{n_{i}}{m_{i}} \theta_{i}^{m_{i}}(1-\theta_{i})^{n_{i}-m_{i}} \\ &= m_{i}\theta_{i}^{m_{i}-1}(1-\theta_{i})^{n_{i}-m_{i}} + \theta_{i}^{m_{i}}(m_{i}-n_{i})(1-\theta_{i})^{n_{i}-m_{i}-1} \\ &= 0 \\ \hat{\theta_{i}}^{MLE} &= \frac{m_{i}}{n_{i}} \end{split}

Bayesian Estimation

First, derive the posterior distribution for θi\theta_i:

P(θimi)=P(miθi)P(θi)P(mi)=P(miθi)P(θi)01P(miθi)P(θi)dθi=(nimi)θimi+αi1(1θi)nimi+βi1/B(αi,βi)01((nimi)θimi+αi1(1θi)nimi+βi1)dθi/B(αi,βi)=(nimi)θimi+αi1(1θi)nimi+βi1B(mi+αi,nimi+βi)\begin{split} P(\theta_i|m_i) & = \frac{P(m_i|\theta_i)P(\theta_i)}{P(m_i)} = \frac{P(m_i|\theta_i)P(\theta_i)}{\int_{0}^{1} P(m_i|\theta_i)P(\theta_i)d\theta_i} \\ &= \frac{\binom{n_i}{m_i}\theta_{i}^{m_{i}+\alpha_{i}-1} (1-\theta_i)^{n_{i}-m_{i}+\beta_{i}-1}/\mathrm{B}(\alpha_{i},\beta_{i})}{\int_0^1\left(\binom{n_{i}}{m_{i}}\theta_{i}^{m_{i}+\alpha_{i}-1}(1-\theta_{i})^{n_{i}-m_{i}+\beta_{i}-1}\right)d\theta_{i}/\mathrm{B}(\alpha_{i},\beta_{i})} \\ & = \frac{\binom{n_{i}}{m_{i}}\theta_{i}^{m_{i}+\alpha_{i}-1} (1-\theta_{i})^{n_{i}-m_{i}+\beta_{i}-1}}{\mathrm{B}(m_{i}+\alpha_{i},n_{i}-m_{i}+\beta_{i})} \end{split}

where P(mi)P(m_{i}) is the marginal distribution of mim_{i}. The joint density P(θi,mi)P(\theta_{i}, m_{i}) with θi\theta_{i} integrated out can also be viewed as P(θimi)P(θi)P(miθi)P(\theta_{i}|m_{i}) \propto P(\theta_{i}) P(m_{i}|\theta_{i}).

To calculate the point estimator of Maximizing A Posterior (MAP), take the gradient of the probability density function and set it to 0:

θiP(θimi)=dP(θimi)dθi(nimi)θimi+αi1(1θi)nimi+βi1=θimi+αi1(1θi)nimi+βi1((mi+αi1)θi1+(ni+miβi+1)(1θi)1)=0θi^MAP=mi+αi1ni+αi+βi2\begin{split} \nabla_{\theta_{i}} P(\theta_{i}|m_{i}) &= \frac{dP(\theta_{i}|m_{i})}{d\theta_{i}} \cdot \binom{n_i}{m_i} \theta_{i}^{m_i+\alpha_i -1} (1-\theta_i)^{n_i - m_i + \beta_i -1} \\ &= \theta_{i}^{m_i+\alpha_i -1}(1-\theta_i)^{n_i - m_i + \beta_i -1} \left((m_i+\alpha_i -1)\theta_{i}^{-1} + (-n_i + m_i - \beta_i + 1)(1-\theta_i)^{-1}\right) \\ &= 0 \\ \\ \hat{\theta_{i}}^{MAP} &= \frac{m_i + \alpha_i -1}{n_i + \alpha_i + \beta_i - 2} \end{split}

For Bayesian inference, we can get the following from the posterior distribution:

θi^Beta(mi+αi,nimi+βi)=Beta(θi^mi+αi,nimi+βi)\hat{\theta_{i}} \sim \mathrm{Beta}(m_i + \alpha_i, n_i - m_i + \beta_i) = \mathrm{Beta}(\hat{\theta_{i}}|m_i + \alpha_i, n_i - m_i + \beta_i)

Note that the expected value of a beta random variable is:

E[θ]=01θfθ(θ)dθ=01θθα1(1θ)β1B(α,β)dθ=01θα(1θ)β1dθ1B(α,β)=B(α+1,β)B(α,β)(rewrite beta function in gamma function)=Γ(α+1)Γ(β)Γ(α)Γ(α+β+1)(by definition Γ(n)=(n1)!)=αα+β\begin{split} \mathbb{E}[\theta] &= \int_{0}^{1} \theta f_{\theta}(\theta)d\theta = \int_0^1 \theta \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\mathrm{B}(\alpha,\beta)}d\theta \\ &= \int_0^1 \theta^{\alpha}(1-\theta)^{\beta-1}d\theta \cdot \frac{1}{\mathrm{B}(\alpha,\beta)} \\ &= \frac{\mathrm{B}(\alpha+1,\beta)}{\mathrm{B}(\alpha, \beta)} \quad \text{(rewrite beta function in gamma function)} \\ &= \frac{\Gamma(\alpha+1)\Gamma(\beta)}{\Gamma(\alpha)\Gamma(\alpha+\beta+1)} \quad \text{(by definition $\Gamma(n) = (n-1)!$)} \\ &= \frac{\alpha}{\alpha+\beta} \end{split}

The variance is:

Var(θ)=E[θ2]E[θ]2=(α+1)α(α+β+1)(α+β)2α2(α+β)2=αβ(α+β+1)(α+β)2\begin{split} \mathrm{Var}(\theta) &= \mathbb{E}[\theta^2] - \mathbb{E}[\theta]^2 \\ &= \frac{(\alpha+1)\alpha}{(\alpha+\beta+1)(\alpha+\beta)^2} - \frac{\alpha^2}{(\alpha+\beta)^2} \\ &= \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2} \end{split}

Then we know the posterior expected value and posterior variance because αi+=mi\alpha_{i} += m_i, βi+=nimi\beta_i += n_i - m_i:

E(θi^)=mi+αini+αi+βi;Var(θi^)=(mi+αi)(nimi+βi)(ni+αi+βi+1)(ni+αi+βi)2\mathbb{E}(\hat{\theta_{i}}) = \frac{m_i + \alpha_i}{n_i + \alpha_i + \beta_i}; \quad \mathrm{Var}(\hat{\theta_{i}}) = \frac{(m_i + \alpha_i)(n_i - m_i + \beta_i)}{(n_i + \alpha_i + \beta_i + 1)(n_i + \alpha_i + \beta_i)^2}

Empirical Bayesian Inference

Estimating a Prior from All Your Data

Now we can empirically estimate α0\alpha_0 and β0\beta_0 for the prior beta distribution from all data points using MLE with the beta-binomial likelihood function:

L(α,β)=log(α,β)=log(Γ(m+α)Γ(nm+β)Γ(n+α+β)Γ(α+β)Γ(α)Γ(β))(drop constants w.r.t. α and β)\begin{split} \mathcal{L}(\alpha, \beta) &= -\log \ell(\alpha, \beta) \\ &= -\sum \log \left(\frac{\Gamma(m+\alpha)\Gamma(n-m+\beta)}{\Gamma(n+\alpha+\beta)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right) \\ &\text{(drop constants w.r.t. $\alpha$ and $\beta$)} \end{split}

Through reparameterization, we could adjust for the total number of reads on one mismatch site. Likewise, we could correct for other confounding factors such as batch, sex, age, etc. To achieve this, we are going to do regression:

θiBeta(μi,ρi)μi=σ1(XTωi+ϵi)\begin{split} \theta_i &\sim \mathrm{Beta}(\mu_{i}, \rho_{i}) \\ \mu_{i} &= \sigma^{-1}(X^{T}\omega_{i} + \epsilon_{i}) \end{split}

where σ\sigma is the logit link function, XX denotes a design matrix of confounding factors, ωi\omega_{i} denotes coefficients for the covariates in XX, and ϵ\epsilon is the intercept. Potentially, the matrix XX could be:

Xlog(n)+batch+group+age+X \sim \log(n) + \mathrm{batch} + \mathrm{group} + \mathrm{age} + \dots

Likewise, we can get the negative log likelihood with respect to μ\mu and ρ\rho:

L(μ,ρ)=log((μ,ρ))=i=1Ilog(P(mμ,ρ))\mathcal{L}(\mu, \rho) = -\log(\ell(\mu,\rho)) = -\sum_{i=1}^{I} \log(P(m|\mu,\rho))

Updating Each Data Point

Just like the derivation shown above, we can update θi\theta_{i} by calculating the expectation of its posterior distribution, Expected A Posteriori (EAP), as the solution is closed-form:

αi=mi+α0βi=nimi+β0θi^EAP=mi+α0ni+α0+β0\begin{split} \alpha_{i} &= m_{i} + \alpha_{0} \\ \beta_{i} &= n_{i} - m_{i} + \beta_{0} \\ \hat{\theta_{i}}^{EAP} &= \frac{m_{i} + \alpha_{0}}{n_{i} + \alpha_{0} + \beta_{0}} \end{split}

Example Code

#!/usr/bin/env python

import numpy as np
from scipy import stats
from scipy.optimize import minimize
from scipy.special import expit, logit
import pandas as pd
import matplotlib.pyplot as plt

class BetaBinomEmpiricalBayes:
    def __init__(self, optim_method='L-BFGS-B'):
        self.optim_method = optim_method

    def reparam_1(self, α, β):
        μ = α / (α + β)
        ρ = 1 / (α + β + 1)
        return μ, ρ

    def reparam_2(self, μ, ρ):
        α = μ * (1 - ρ) / ρ
        β = (1 - μ) * (1 - ρ) / ρ
        return α, β

    def neglog_likelihood(self, params, *args):
        '''
        Negative log likelihood for beta-binom pdf
        - params: list for parameters to be estimated
        - args: 1d array containing data points
        μ = σ^{-1}(Χω + ε)
        '''
        ρ = params[0]
        ε = params[1]
        ω = params[2]
        
        m = args[0]
        n = args[1]
        x = args[2]

        μ = expit(np.dot(x, ω) + ε)
        α, β = self.reparam_2(μ, ρ)
        logpdf = stats.betabinom.logpmf(k=m, n=n, a=α, b=β, loc=0)
        nll = -np.sum(logpdf)
        return nll

    def neglog_likelihood_ab(self, params, *args):
        '''
        Negative log likelihood for beta-binom pdf
        - params: list for parameters to be estimated
        - args: 1d array containing data points
        '''
        a = params[0]
        b = params[1]
        
        m = args[0]
        n = args[1]

        logpdf = stats.betabinom.logpmf(k=m, n=n, a=a, b=b, loc=0)
        nll = -np.sum(logpdf)
        return nll

    def fit_betabinom(self, params_init, data, bounds):
        '''Maximum likelihood estimation'''
        res = minimize(self.neglog_likelihood,
                       x0 = params_init,
                       args = data,
                       method = self.optim_method,
                       bounds = bounds)

        #self.params_bb_mle = res.x
        return res

    def post_reparam(self, ρ, ω, x, ε):
        '''
        convert μ, ρ (post MLE) back to α, β
        '''
        μ = expit(np.dot(x, ω) + ε)
        α, β = self.reparam_2(μ, ρ)
        return α, β

    def update_alpha(self, m, α):
        return m + α

    def update_beta(self, m, n, β):
        return n - m + β

    def update_theta(self, α_1, β_1):
        '''
        use the estimated params to update each datapoints 
        posterior θ (expectation and variance)
        m, n are arrays
        '''
        θ_expected = α_1 / (α_1 + β_1)
        θ_var = (α_1 * β_1) / ((α_1 + β_1 + 1) * (α_1 + β_1)**2)
        return θ_expected, θ_var

    def credible_interval(self, alpha, beta):
        '''95% interval'''
        low = stats.beta.ppf(0.025, alpha, beta, loc=0, scale=1)
        high = stats.beta.ppf(0.975, alpha, beta, loc=0, scale=1)
        return low, high

    def posterior_err_prob(self, alpha, beta, cutoff):
        '''cumulative distribution function'''
        try:
            pep = stats.beta.cdf(cutoff, alpha, beta, loc=0, scale=1)
        except:
            pep = float('inf')
        return pep

    def q_value(self, pep):
        '''cumulative mean of all sorted posterior error probabilities'''
        s = pd.Series(pep).sort_values()
        return s.cumsum()

    def plot_beta_dist(self, a, b, save_to=None):
        '''
        code is copied from 
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html
        '''
        mean, var, skew, kurt = stats.beta.stats(a, b, moments='mvsk')
        print(f'mean={mean}, var={var}, skew={skew}, kurt={kurt}')

        fig, ax = plt.subplots(1, 1)
        x = np.linspace(stats.beta.ppf(0.01, a, b),
                        stats.beta.ppf(0.99, a, b), 100)
        ax.plot(x, stats.beta.pdf(x, a, b),
               'r-', lw=3.5, alpha=0.6, label='beta pdf')

        r = stats.beta.rvs(a, b, size=1000)
        ax.hist(r, density=True, bins=30, histtype='stepfilled', alpha=0.2)
        ax.legend(loc='best', frameon=False)
        plt.title(f'a={"%.4f" % a}, b={"%.4f" % b}')
        if save_to:
            plt.save(save_to)
        else:
            plt.show()

def run_bb_eBayes(table, ab_mode=False):
    # table is a pandas table
    m = table['m'].to_numpy()
    n = table['n'].to_numpy()
    log_n = np.log(n)

    reg = BetaBinomEmpiricalBayes()
    if not ab_mode:
        # ρ, ε, ω
        params_init = np.array([0.5, 0.5, 0.8])
        # MLE
        res = reg.fit_betabinom(params_init=params_init, 
                                data=(m, n, log_n),
                                bounds=[(0.001, None),
                                        (None, None),
                                        (None, None)])
        # update each datapoint
        ρ, ε, ω = res.x
        params = f'ρ={"%.4f" % ρ},ε={"%.4f" % ε},ω={"%.4f" % ω}'
        α_0, β_0 = reg.post_reparam(ρ=ρ, ω=ω, x=log_n, ε=ε)
    else:
        # α, β
        params_init = np.array([1, 1])
        # MLE
        res = reg.fit_betabinom(params_init=params_init, 
                                data=(m, n),
                                bounds=[(0.001, None),
                                        (0.001, None)])
        # update each datapoint
        α_0, β_0 = res.x
        params = f'a={"%.4f" % α_0},b={"%.4f" % β_0}'


    α_1 = reg.update_alpha(m, α_0)
    β_1 = reg.update_beta(m, n, β_0)
    θ_expected, θ_var = reg.update_theta(α_1, β_1)
    table['α_1'] = ["%.3f" % a for a in α_1]
    table['β_1'] = ["%.3f" % b for b in β_1]
    table['freq'] = ["%.4f" % f for f in (m/n)]
    table['eb_freq'] = ["%.4f" % e for e in θ_expected]
    table['eb_var'] = ["{:.3e}".format(v) for v in θ_var]
    table = table.sort_values(by=['eb_var']).\
                  sort_values(by=['eb_freq'], ascending=False)

    return table, params