Beta-binomial empirical Bayesian modeling

Beta-binomial Model

Here I am using the simplest beta-binomial model. Suppose we observed $n$ reads covering a mismatch position $i$ , of which $m$ reads harbor an allele that is considered mutated. Mutation rate $\theta_{i}$ follows a beta distribution.

\begin{split} m_{i} &\sim \text{Binom}(n_{i}, \theta_{i}) \\ \theta_{i} &\sim \text{Beta}(\alpha_{i}, \beta_{i}) \end{split}

where $\alpha_i$ and $\beta_i$ are hyperparameters in the beta function $\mathrm{B}()$ .

We know that

\theta \sim \text{Beta}(\alpha, \beta) = \text{Beta}(\theta|\alpha, \beta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\text{B}(\alpha, \beta)}

\begin{split} \mathrm{B}(\alpha, \beta) &= \int_0^1 t^{\alpha-1} (1-t)^{\beta-1} dt \quad \text{(integral definition)} \\ \mathrm{B}(\alpha, \beta) &= \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)} \quad \text{(relationship to gamma function)} \end{split}

We can marginalize out the latent variable $\theta_{i}$ , and the probability of having $m_i$ given $n_i$ , $\alpha_{i}$ , $\beta_{i}$ becomes:

\begin{split} P(m_i|n_i,\alpha_i,\beta_i) & = \int_{0}^{1} \text{Binom}(m_i|n_i,\theta_i) \cdot \text{Beta}(\theta_i|\alpha_i,\beta_i) \, d\theta_i \\ & = \int_0^1 \binom{n_i}{m_i}\theta_{i}^{m_i}(1-\theta_i)^{n_i-m_i} \cdot \frac{1}{\text{B}(\alpha_i, \beta_i)} \theta_i^{\alpha_{i}-1}(1-\theta_i)^{\beta_{i}-1} \, d\theta_i \\ & = \binom{n_i}{m_i} \int_0^1 \theta_{i}^{m_i+\alpha_{i}-1}(1-\theta_i)^{n_i-m_i+\beta_{i}-1} \, d\theta_i \cdot \frac{1}{\text{B}(\alpha_{i}, \beta_{i})} \\ & = \binom{n_i}{m_i} \frac{\text{B}(m_i+\alpha_{i}, n_i-m_i+\beta_{i})}{\text{B}(\alpha_{i}, \beta_{i})} \end{split}

Reparameterization

We can also reparameterize $\alpha$ and $\beta$ to $\mu$ and $\rho$ :

\begin{split} \mu = \frac{\alpha}{\alpha+\beta}; & \; \rho = \frac{1}{\alpha+\beta+1} \\ \alpha = \frac{\mu(1-\rho)}{\rho}; & \; \beta = \frac{(1-\mu)(1-\rho)}{\rho} \end{split}

where $\mu$ corresponds to the estimation of $\theta$ , and $\rho$ corresponds to the extent of overdispersion.

The expectation and variance of the beta-binomial distribution are:

\mathbb{E}(m|n,\mu,\rho) = n\mu; \quad \text{Var}(m|n,\mu,\rho) = n\mu(1-\mu)[1+(n-1)\rho]

Maximum Likelihood Estimation (Frequentist)

We can estimate $\alpha_{i}$ and $\beta_{i}$ by Maximum Likelihood Estimation (MLE). The likelihood function is:

\ell(\alpha_{i},\beta_{i}) = \prod_{i=1}^{I} P(m_{i}|\alpha_{i},\beta_{i}) = \prod_{i=1}^{I} \binom{n_{i}}{m_{i}} \frac{\text{B}(m_{i}+\alpha_{i}, n_{i}-m_{i}+\beta_{i})}{\text{B}(\alpha_{i}, \beta_{i})}

After reparameterization:

\ell(\mu_{i},\rho_{i}) = \prod_{i=1}^{I} P(m_{i}|\mu_{i},\rho_{i}) = \prod_{i=1}^{I} \binom{n_{i}}{m_{i}} \frac{\mathrm{B}\left(m_{i}+\frac{\mu_{i}(1-\rho_{i})}{\rho_{i}}, n_{i}-m_{i}+\frac{(1-\mu_{i})(1-\rho_{i})}{\rho_{i}}\right)}{\mathrm{B}\left(\frac{\mu_{i}(1-\rho_{i})}{\rho_{i}}, \frac{(1-\mu_{i})(1-\rho_{i})}{\rho_{i}}\right)}

We can get the point estimation by taking the derivative of the pdf and setting it to 0:

\begin{split} \nabla_{\theta_{i}} P(m_{i}|\theta_{i}) &= \frac{dP(m_{i}|\theta_{i})}{d\theta_{i}} \cdot \binom{n_{i}}{m_{i}} \theta_{i}^{m_{i}}(1-\theta_{i})^{n_{i}-m_{i}} \\ &= m_{i}\theta_{i}^{m_{i}-1}(1-\theta_{i})^{n_{i}-m_{i}} + \theta_{i}^{m_{i}}(m_{i}-n_{i})(1-\theta_{i})^{n_{i}-m_{i}-1} \\ &= 0 \\ \hat{\theta_{i}}^{MLE} &= \frac{m_{i}}{n_{i}} \end{split}

Bayesian Estimation

First, derive the posterior distribution for $\theta_i$ :

\begin{split} P(\theta_i|m_i) & = \frac{P(m_i|\theta_i)P(\theta_i)}{P(m_i)} = \frac{P(m_i|\theta_i)P(\theta_i)}{\int_{0}^{1} P(m_i|\theta_i)P(\theta_i)d\theta_i} \\ &= \frac{\binom{n_i}{m_i}\theta_{i}^{m_{i}+\alpha_{i}-1} (1-\theta_i)^{n_{i}-m_{i}+\beta_{i}-1}/\mathrm{B}(\alpha_{i},\beta_{i})}{\int_0^1\left(\binom{n_{i}}{m_{i}}\theta_{i}^{m_{i}+\alpha_{i}-1}(1-\theta_{i})^{n_{i}-m_{i}+\beta_{i}-1}\right)d\theta_{i}/\mathrm{B}(\alpha_{i},\beta_{i})} \\ & = \frac{\binom{n_{i}}{m_{i}}\theta_{i}^{m_{i}+\alpha_{i}-1} (1-\theta_{i})^{n_{i}-m_{i}+\beta_{i}-1}}{\mathrm{B}(m_{i}+\alpha_{i},n_{i}-m_{i}+\beta_{i})} \end{split}

where $P(m_{i})$ is the marginal distribution of $m_{i}$ . The joint density $P(\theta_{i}, m_{i})$ with $\theta_{i}$ integrated out can also be viewed as $P(\theta_{i}|m_{i}) \propto P(\theta_{i}) P(m_{i}|\theta_{i})$ .

To calculate the point estimator of Maximizing A Posterior (MAP), take the gradient of the probability density function and set it to 0:

\begin{split} \nabla_{\theta_{i}} P(\theta_{i}|m_{i}) &= \frac{dP(\theta_{i}|m_{i})}{d\theta_{i}} \cdot \binom{n_i}{m_i} \theta_{i}^{m_i+\alpha_i -1} (1-\theta_i)^{n_i - m_i + \beta_i -1} \\ &= \theta_{i}^{m_i+\alpha_i -1}(1-\theta_i)^{n_i - m_i + \beta_i -1} \left((m_i+\alpha_i -1)\theta_{i}^{-1} + (-n_i + m_i - \beta_i + 1)(1-\theta_i)^{-1}\right) \\ &= 0 \\ \\ \hat{\theta_{i}}^{MAP} &= \frac{m_i + \alpha_i -1}{n_i + \alpha_i + \beta_i - 2} \end{split}

For Bayesian inference, we can get the following from the posterior distribution:

\hat{\theta_{i}} \sim \mathrm{Beta}(m_i + \alpha_i, n_i - m_i + \beta_i) = \mathrm{Beta}(\hat{\theta_{i}}|m_i + \alpha_i, n_i - m_i + \beta_i)

Note that the expected value of a beta random variable is:

\begin{split} \mathbb{E}[\theta] &= \int_{0}^{1} \theta f_{\theta}(\theta)d\theta = \int_0^1 \theta \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\mathrm{B}(\alpha,\beta)}d\theta \\ &= \int_0^1 \theta^{\alpha}(1-\theta)^{\beta-1}d\theta \cdot \frac{1}{\mathrm{B}(\alpha,\beta)} \\ &= \frac{\mathrm{B}(\alpha+1,\beta)}{\mathrm{B}(\alpha, \beta)} \quad \text{(rewrite beta function in gamma function)} \\ &= \frac{\Gamma(\alpha+1)\Gamma(\beta)}{\Gamma(\alpha)\Gamma(\alpha+\beta+1)} \quad \text{(by definition $\Gamma(n) = (n-1)!$)} \\ &= \frac{\alpha}{\alpha+\beta} \end{split}

The variance is:

\begin{split} \mathrm{Var}(\theta) &= \mathbb{E}[\theta^2] - \mathbb{E}[\theta]^2 \\ &= \frac{(\alpha+1)\alpha}{(\alpha+\beta+1)(\alpha+\beta)^2} - \frac{\alpha^2}{(\alpha+\beta)^2} \\ &= \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2} \end{split}

Then we know the posterior expected value and posterior variance because $\alpha_{i} += m_i$ , $\beta_i += n_i - m_i$ :

\mathbb{E}(\hat{\theta_{i}}) = \frac{m_i + \alpha_i}{n_i + \alpha_i + \beta_i}; \quad \mathrm{Var}(\hat{\theta_{i}}) = \frac{(m_i + \alpha_i)(n_i - m_i + \beta_i)}{(n_i + \alpha_i + \beta_i + 1)(n_i + \alpha_i + \beta_i)^2}

Empirical Bayesian Inference

Estimating a Prior from All Your Data

Now we can empirically estimate $\alpha_0$ and $\beta_0$ for the prior beta distribution from all data points using MLE with the beta-binomial likelihood function:

\begin{split} \mathcal{L}(\alpha, \beta) &= -\log \ell(\alpha, \beta) \\ &= -\sum \log \left(\frac{\Gamma(m+\alpha)\Gamma(n-m+\beta)}{\Gamma(n+\alpha+\beta)} \cdot \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right) \\ &\text{(drop constants w.r.t. $\alpha$ and $\beta$)} \end{split}

Through reparameterization, we could adjust for the total number of reads on one mismatch site. Likewise, we could correct for other confounding factors such as batch, sex, age, etc. To achieve this, we are going to do regression:

\begin{split} \theta_i &\sim \mathrm{Beta}(\mu_{i}, \rho_{i}) \\ \mu_{i} &= \sigma^{-1}(X^{T}\omega_{i} + \epsilon_{i}) \end{split}

where $\sigma$ is the logit link function, $X$ denotes a design matrix of confounding factors, $\omega_{i}$ denotes coefficients for the covariates in $X$ , and $\epsilon$ is the intercept. Potentially, the matrix $X$ could be:

X \sim \log(n) + \mathrm{batch} + \mathrm{group} + \mathrm{age} + \dots

Likewise, we can get the negative log likelihood with respect to $\mu$ and $\rho$ :

\mathcal{L}(\mu, \rho) = -\log(\ell(\mu,\rho)) = -\sum_{i=1}^{I} \log(P(m|\mu,\rho))

Updating Each Data Point

Just like the derivation shown above, we can update $\theta_{i}$ by calculating the expectation of its posterior distribution, Expected A Posteriori (EAP), as the solution is closed-form:

\begin{split} \alpha_{i} &= m_{i} + \alpha_{0} \\ \beta_{i} &= n_{i} - m_{i} + \beta_{0} \\ \hat{\theta_{i}}^{EAP} &= \frac{m_{i} + \alpha_{0}}{n_{i} + \alpha_{0} + \beta_{0}} \end{split}

Example Code

#!/usr/bin/env python

import numpy as np
from scipy import stats
from scipy.optimize import minimize
from scipy.special import expit, logit
import pandas as pd
import matplotlib.pyplot as plt

class BetaBinomEmpiricalBayes:
    def __init__(self, optim_method='L-BFGS-B'):
        self.optim_method = optim_method

    def reparam_1(self, α, β):
        μ = α / (α + β)
        ρ = 1 / (α + β + 1)
        return μ, ρ

    def reparam_2(self, μ, ρ):
        α = μ * (1 - ρ) / ρ
        β = (1 - μ) * (1 - ρ) / ρ
        return α, β

    def neglog_likelihood(self, params, *args):
        '''
        Negative log likelihood for beta-binom pdf
        - params: list for parameters to be estimated
        - args: 1d array containing data points
        μ = σ^{-1}(Χω + ε)
        '''
        ρ = params[0]
        ε = params[1]
        ω = params[2]
        
        m = args[0]
        n = args[1]
        x = args[2]

        μ = expit(np.dot(x, ω) + ε)
        α, β = self.reparam_2(μ, ρ)
        logpdf = stats.betabinom.logpmf(k=m, n=n, a=α, b=β, loc=0)
        nll = -np.sum(logpdf)
        return nll

    def neglog_likelihood_ab(self, params, *args):
        '''
        Negative log likelihood for beta-binom pdf
        - params: list for parameters to be estimated
        - args: 1d array containing data points
        '''
        a = params[0]
        b = params[1]
        
        m = args[0]
        n = args[1]

        logpdf = stats.betabinom.logpmf(k=m, n=n, a=a, b=b, loc=0)
        nll = -np.sum(logpdf)
        return nll

    def fit_betabinom(self, params_init, data, bounds):
        '''Maximum likelihood estimation'''
        res = minimize(self.neglog_likelihood,
                       x0 = params_init,
                       args = data,
                       method = self.optim_method,
                       bounds = bounds)

        #self.params_bb_mle = res.x
        return res

    def post_reparam(self, ρ, ω, x, ε):
        '''
        convert μ, ρ (post MLE) back to α, β
        '''
        μ = expit(np.dot(x, ω) + ε)
        α, β = self.reparam_2(μ, ρ)
        return α, β

    def update_alpha(self, m, α):
        return m + α

    def update_beta(self, m, n, β):
        return n - m + β

    def update_theta(self, α_1, β_1):
        '''
        use the estimated params to update each datapoints 
        posterior θ (expectation and variance)
        m, n are arrays
        '''
        θ_expected = α_1 / (α_1 + β_1)
        θ_var = (α_1 * β_1) / ((α_1 + β_1 + 1) * (α_1 + β_1)**2)
        return θ_expected, θ_var

    def credible_interval(self, alpha, beta):
        '''95% interval'''
        low = stats.beta.ppf(0.025, alpha, beta, loc=0, scale=1)
        high = stats.beta.ppf(0.975, alpha, beta, loc=0, scale=1)
        return low, high

    def posterior_err_prob(self, alpha, beta, cutoff):
        '''cumulative distribution function'''
        try:
            pep = stats.beta.cdf(cutoff, alpha, beta, loc=0, scale=1)
        except:
            pep = float('inf')
        return pep

    def q_value(self, pep):
        '''cumulative mean of all sorted posterior error probabilities'''
        s = pd.Series(pep).sort_values()
        return s.cumsum()

    def plot_beta_dist(self, a, b, save_to=None):
        '''
        code is copied from 
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html
        '''
        mean, var, skew, kurt = stats.beta.stats(a, b, moments='mvsk')
        print(f'mean={mean}, var={var}, skew={skew}, kurt={kurt}')

        fig, ax = plt.subplots(1, 1)
        x = np.linspace(stats.beta.ppf(0.01, a, b),
                        stats.beta.ppf(0.99, a, b), 100)
        ax.plot(x, stats.beta.pdf(x, a, b),
               'r-', lw=3.5, alpha=0.6, label='beta pdf')

        r = stats.beta.rvs(a, b, size=1000)
        ax.hist(r, density=True, bins=30, histtype='stepfilled', alpha=0.2)
        ax.legend(loc='best', frameon=False)
        plt.title(f'a={"%.4f" % a}, b={"%.4f" % b}')
        if save_to:
            plt.save(save_to)
        else:
            plt.show()

def run_bb_eBayes(table, ab_mode=False):
    # table is a pandas table
    m = table['m'].to_numpy()
    n = table['n'].to_numpy()
    log_n = np.log(n)

    reg = BetaBinomEmpiricalBayes()
    if not ab_mode:
        # ρ, ε, ω
        params_init = np.array([0.5, 0.5, 0.8])
        # MLE
        res = reg.fit_betabinom(params_init=params_init, 
                                data=(m, n, log_n),
                                bounds=[(0.001, None),
                                        (None, None),
                                        (None, None)])
        # update each datapoint
        ρ, ε, ω = res.x
        params = f'ρ={"%.4f" % ρ},ε={"%.4f" % ε},ω={"%.4f" % ω}'
        α_0, β_0 = reg.post_reparam(ρ=ρ, ω=ω, x=log_n, ε=ε)
    else:
        # α, β
        params_init = np.array([1, 1])
        # MLE
        res = reg.fit_betabinom(params_init=params_init, 
                                data=(m, n),
                                bounds=[(0.001, None),
                                        (0.001, None)])
        # update each datapoint
        α_0, β_0 = res.x
        params = f'a={"%.4f" % α_0},b={"%.4f" % β_0}'


    α_1 = reg.update_alpha(m, α_0)
    β_1 = reg.update_beta(m, n, β_0)
    θ_expected, θ_var = reg.update_theta(α_1, β_1)
    table['α_1'] = ["%.3f" % a for a in α_1]
    table['β_1'] = ["%.3f" % b for b in β_1]
    table['freq'] = ["%.4f" % f for f in (m/n)]
    table['eb_freq'] = ["%.4f" % e for e in θ_expected]
    table['eb_var'] = ["{:.3e}".format(v) for v in θ_var]
    table = table.sort_values(by=['eb_var']).\
                  sort_values(by=['eb_freq'], ascending=False)

    return table, params