In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.
The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g.  ) which models a set of data.  The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall.  Thus improving the ELBO score indicates either improving the likelihood of the model
) which models a set of data.  The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall.  Thus improving the ELBO score indicates either improving the likelihood of the model  or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component.  (The internal component is
 or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component.  (The internal component is  , defined in detail later in this article.)
, defined in detail later in this article.)
Definition
Let  and
 and  be random variables, jointly distributed with distribution
 be random variables, jointly distributed with distribution  . For example,
. For example,  is the marginal distribution of
 is the marginal distribution of  , and
, and  is the conditional distribution of
 is the conditional distribution of  given
 given  . Then, for a sample
. Then, for a sample  , and any distribution
, and any distribution  , the ELBO is defined as
, the ELBO is defined as![{\displaystyle L(\phi ,\theta ;x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right].}](./_assets_/563d2097ea46196c5c8ff7b2c7e8c6532d30195b.svg) The ELBO can equivalently be written as[2]
The ELBO can equivalently be written as[2] 
![{\displaystyle {\begin{aligned}L(\phi ,\theta ;x)=&\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {}p_{\theta }(x,z)\right]+H[q_{\phi }(z|x)]\\=&\mathbb {\ln } {}\,p_{\theta }(x)-D_{KL}(q_{\phi }(z|x)||p_{\theta }(z|x)).\\\end{aligned}}}](./_assets_/37af6d9b7947f36f6beefc7a848de47fdeea2ebc.svg) 
In the first line, ![{\displaystyle H[q_{\phi }(z|x)]}](./_assets_/e213902978ceb5660808f87aff41057a34c016b2.svg) is the entropy of
 is the entropy of  , which relates the ELBO to the Helmholtz free energy.[3] In the second line,
, which relates the ELBO to the Helmholtz free energy.[3] In the second line,  is called the evidence for
 is called the evidence for  , and
, and  is the Kullback-Leibler divergence between
 is the Kullback-Leibler divergence between  and
 and  . Since the Kullback-Leibler divergence is non-negative,
. Since the Kullback-Leibler divergence is non-negative,  forms a lower bound on the evidence (ELBO inequality)
 forms a lower bound on the evidence (ELBO inequality)![{\displaystyle \ln p_{\theta }(x)\geq \mathbb {\mathbb {E} } _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z\vert x)}}\right].}](./_assets_/62f8b9df47fc53a1f8df410145d5e5d71a91e887.svg) 
Motivation
Variational Bayesian inference
Suppose we have an observable random variable  , and we want to find its true distribution
, and we want to find its true distribution  . This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find
. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find  exactly, forcing us to search for a good approximation.
 exactly, forcing us to search for a good approximation.
That is, we define a sufficiently large parametric family  of distributions, then solve for
 of distributions, then solve for  for some loss function
 for some loss function  . One possible way to solve this is by considering small variation from
. One possible way to solve this is by considering small variation from  to
 to  , and solve for
, and solve for  . This is a problem in the calculus of variations, thus it is called the variational method.
. This is a problem in the calculus of variations, thus it is called the variational method.
Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:
- First, define a simple distribution  over a latent random variable over a latent random variable . Usually a normal distribution or a uniform distribution suffices. . Usually a normal distribution or a uniform distribution suffices.
- Next, define a family of complicated functions  (such as a deep neural network) parametrized by (such as a deep neural network) parametrized by . .
- Finally, define a way to convert any  into a distribution (in general simple too, but unrelated to into a distribution (in general simple too, but unrelated to ) over the observable random variable ) over the observable random variable . For example, let . For example, let have two outputs, then we can define the corresponding distribution over have two outputs, then we can define the corresponding distribution over to be the normal distribution to be the normal distribution . .
This defines a family of joint distributions  over
 over  . It is very easy to sample
. It is very easy to sample  : simply sample
: simply sample  , then compute
, then compute  , and finally sample
, and finally sample  using
 using  .
.
In other words, we have a generative model for both the observable and the latent.
Now, we consider a distribution  good, if it is a close approximation of
 good, if it is a close approximation of  :
: since the distribution on the right side is over
since the distribution on the right side is over  only, the distribution on the left side must marginalize the latent variable
 only, the distribution on the left side must marginalize the latent variable  away.
 away.
In general, it's impossible to perform the integral  , forcing us to perform another approximation.
, forcing us to perform another approximation.
Since  (Bayes' Rule), it suffices to find a good approximation of
 (Bayes' Rule), it suffices to find a good approximation of  . So define another distribution family
. So define another distribution family  and use it to approximate
 and use it to approximate  . This is a discriminative model for the latent.
. This is a discriminative model for the latent.
The entire situation is summarized in the following table:
|  : observable |   |  : latent | 
|  approximable |  |  , easy | 
|  |  , easy |  | 
|  approximable |  |  , easy | 
In Bayesian language,  is the observed evidence, and
 is the observed evidence, and  is the latent/unobserved. The distribution
 is the latent/unobserved. The distribution  over
 over  is the prior distribution over
 is the prior distribution over  ,
,  is the likelihood function, and
 is the likelihood function, and  is the posterior distribution over
 is the posterior distribution over  .
.
Given an observation  , we can infer what
, we can infer what  likely gave rise to
 likely gave rise to  by computing
 by computing  . The usual Bayesian method is to estimate the integral
. The usual Bayesian method is to estimate the integral  , then compute by Bayes' rule
, then compute by Bayes' rule  . This is expensive to perform in general, but if we can simply find a good approximation
. This is expensive to perform in general, but if we can simply find a good approximation  for most
 for most  , then we can infer
, then we can infer  from
 from  cheaply. Thus, the search for a good
 cheaply. Thus, the search for a good  is also called amortized inference.
 is also called amortized inference.
All in all, we have found a problem of variational Bayesian inference.
Deriving the ELBO
A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:![{\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]=-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))}](./_assets_/cc47ba27f30cea4d3654251260aea344dcce6e81.svg) where
where ![{\displaystyle H(p^{*})=-\mathbb {\mathbb {E} } _{x\sim p^{*}}[\ln p^{*}(x)]}](./_assets_/59fe2b86fcfa3268c2f20ca0e6569e79797d016c.svg) is the entropy of the true distribution. So if we can maximize
 is the entropy of the true distribution. So if we can maximize ![{\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]}](./_assets_/0c86832bdf221015b9268ed4ec8eb11eb7f33532.svg) , we can minimize
, we can minimize  , and consequently find an accurate approximation
, and consequently find an accurate approximation  .
.
To maximize ![{\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]}](./_assets_/0c86832bdf221015b9268ed4ec8eb11eb7f33532.svg) , we simply sample many
, we simply sample many  , i.e. use importance sampling
, i.e. use importance sampling![{\displaystyle N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]\approx \max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})}](./_assets_/748b2ceebd9957afa6afc17b7fe5e18a159e5230.svg) where
where  is the number of samples drawn from the true distribution.  This approximation can be seen as overfitting.[note 1]
 is the number of samples drawn from the true distribution.  This approximation can be seen as overfitting.[note 1]
In order to maximize  , it's necessary to find
, it's necessary to find  :
: This usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:
This usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:![{\displaystyle \int p_{\theta }(x|z)p(z)dz=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]}](./_assets_/6fe68c2ac7d135bf512d6253ed827abac3606ac0.svg) where
where  is a sampling distribution over
 is a sampling distribution over  that we use to perform the Monte Carlo integration.
 that we use to perform the Monte Carlo integration.
So we see that if we sample  , then
, then  is an unbiased estimator of
 is an unbiased estimator of  . Unfortunately, this does not give us an unbiased estimator of
. Unfortunately, this does not give us an unbiased estimator of  , because
, because  is nonlinear. Indeed, we have by Jensen's inequality,
 is nonlinear. Indeed, we have by Jensen's inequality, ![{\displaystyle \ln p_{\theta }(x)=\ln \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]\geq \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]}](./_assets_/1047b927c34236d83d87da92da2eeb685125c576.svg) In fact, all the obvious estimators of
In fact, all the obvious estimators of  are biased downwards, because no matter how many samples of
 are biased downwards, because no matter how many samples of  we take, we have by Jensen's inequality:
 we take, we have by Jensen's inequality:![{\displaystyle \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(x,z_{i})}{q_{\phi }(z_{i}|x)}}\right)\right]\leq \ln \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[{\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(x,z_{i})}{q_{\phi }(z_{i}|x)}}\right]=\ln p_{\theta }(x)}](./_assets_/a28be4cdaaaf44f841a0ae08fc22fe3490c9b110.svg) Subtracting the right side, we see that the problem comes down to a biased estimator of zero:
Subtracting the right side, we see that the problem comes down to a biased estimator of zero:![{\displaystyle \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(z_{i}|x)}{q_{\phi }(z_{i}|x)}}\right)\right]\leq 0}](./_assets_/8776ce423e76fda7783853b4f1a09f4fd0869072.svg) At this point, we could branch off towards the development of an importance-weighted autoencoder[note 2], but we will instead continue with the simplest case with
At this point, we could branch off towards the development of an importance-weighted autoencoder[note 2], but we will instead continue with the simplest case with  :
:![{\displaystyle \ln p_{\theta }(x)=\ln \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]\geq \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]}](./_assets_/1047b927c34236d83d87da92da2eeb685125c576.svg) The tightness of the inequality has a closed form:
The tightness of the inequality has a closed form:![{\displaystyle \ln p_{\theta }(x)-\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]=D_{\mathit {KL}}(q_{\phi }(\cdot |x)\|p_{\theta }(\cdot |x))\geq 0}](./_assets_/0eac216a4b727a67753eb75335f13af8c96ba98d.svg) We have thus obtained the ELBO function:
We have thus obtained the ELBO function: 
Maximizing the ELBO
For fixed  , the optimization
, the optimization  simultaneously attempts to maximize
 simultaneously attempts to maximize  and minimize
 and minimize  . If the parametrization for
. If the parametrization for  and
 and  are flexible enough, we would obtain some
 are flexible enough, we would obtain some  , such that we have simultaneously
, such that we have simultaneously
 Since
Since![{\displaystyle \mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]=-H(p^{*})-D_{\mathit {KL}}(p^{*}(x)\|p_{\theta }(x))}](./_assets_/cc47ba27f30cea4d3654251260aea344dcce6e81.svg) we have
we have and so
and so In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model
In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model  and an accurate discriminative model
 and an accurate discriminative model  .[5]
.[5]
Main forms
The ELBO has many possible expressions, each with some different emphasis.
![{\displaystyle \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}\right]=\int q_{\phi }(z|x)\ln {\frac {p_{\theta }(x,z)}{q_{\phi }(z|x)}}dz}](./_assets_/5d83badbcdb818c6a126045ed5aa2b6865757465.svg) 
This form shows that if we sample  , then
, then  is an unbiased estimator of the ELBO.
 is an unbiased estimator of the ELBO.
 
This form shows that the ELBO is a lower bound on the evidence  , and that maximizing the ELBO with respect to
, and that maximizing the ELBO with respect to  is equivalent to minimizing the KL-divergence from
 is equivalent to minimizing the KL-divergence from   to
 to  .
.
![{\displaystyle \mathbb {E} _{z\sim q_{\phi }(\cdot |x)}[\ln p_{\theta }(x|z)]-D_{\mathit {KL}}(q_{\phi }(\cdot |x)\;\|\;p)}](./_assets_/c9c243c4402ee2c8f2381cb4262c50271a39500f.svg) 
This form shows that maximizing the ELBO simultaneously attempts to keep  close to
 close to  and concentrate
 and concentrate  on those
 on those  that maximizes
 that maximizes  . That is, the approximate posterior
. That is, the approximate posterior  balances between staying close to the prior
 balances between staying close to the prior  and moving towards the maximum likelihood
 and moving towards the maximum likelihood  .
.
Data-processing inequality
Suppose we take  independent samples from
 independent samples from  , and collect them in the dataset
, and collect them in the dataset  , then we have empirical distribution
, then we have empirical distribution  .
.
Fitting  to
 to  can be done, as usual, by maximizing the loglikelihood
 can be done, as usual, by maximizing the loglikelihood  :
: Now, by the ELBO inequality, we can bound
Now, by the ELBO inequality, we can bound  , and thus
, and thus The right-hand-side simplifies to a KL-divergence, and so we get:
The right-hand-side simplifies to a KL-divergence, and so we get: This result can be interpreted as a special case of the data processing inequality.
This result can be interpreted as a special case of the data processing inequality.
In this interpretation, maximizing  is minimizing
 is minimizing  , which upper-bounds the real quantity of interest
, which upper-bounds the real quantity of interest  via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.[6]
 via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.[6]
References
- ^ Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
- ^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "Chapter 19". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
- ^ Hinton, Geoffrey E; Zemel, Richard (1993). "Autoencoders, Minimum Description Length and Helmholtz Free Energy". Advances in Neural Information Processing Systems. 6. Morgan-Kaufmann.
- ^ Burda, Yuri; Grosse, Roger; Salakhutdinov, Ruslan (2015-09-01). "Importance Weighted Autoencoders". arXiv:1509.00519 [stat.ML].
- ^ Neal, Radford M.; Hinton, Geoffrey E. (1998), "A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants", Learning in Graphical Models, Dordrecht: Springer Netherlands, pp. 355–368, doi:10.1007/978-94-011-5014-9_12, ISBN 978-94-010-6104-9, S2CID 17947141
- ^ Kingma, Diederik P.; Welling, Max (2019-11-27). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4). Section 2.7. arXiv:1906.02691. doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.
Notes
- ^ In fact, by Jensen's inequality,
![{\displaystyle \mathbb {E} _{x\sim p^{*}(x)}\left[\max _{\theta }\sum _{i}\ln p_{\theta }(x_{i})\right]\geq \max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}\left[\sum _{i}\ln p_{\theta }(x_{i})\right]=N\max _{\theta }\mathbb {E} _{x\sim p^{*}(x)}[\ln p_{\theta }(x)]}](./_assets_/509fbe8661f40a78b8ee956914cd909be83d5f55.svg) The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data , there is usually some , there is usually some that fits them better than the entire that fits them better than the entire distribution. distribution.
- ^ By the delta method, we have![{\displaystyle \mathbb {E} _{z_{i}\sim q_{\phi }(\cdot |x)}\left[\ln \left({\frac {1}{N}}\sum _{i}{\frac {p_{\theta }(z_{i}|x)}{q_{\phi }(z_{i}|x)}}\right)\right]\approx -{\frac {1}{2N}}\mathbb {V} _{z\sim q_{\phi }(\cdot |x)}\left[{\frac {p_{\theta }(z|x)}{q_{\phi }(z|x)}}\right]=O(N^{-1})}](./_assets_/1186a52c9d4910ee2d6f9964422b0a4c8de71347.svg) If we continue with this, we would obtain the importance-weighted autoencoder.[4] If we continue with this, we would obtain the importance-weighted autoencoder.[4]