In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.
The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g. ) which models a set of data. The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall. Thus improving the ELBO score indicates either improving the likelihood of the model or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component. (The internal component is , defined in detail later in this article.)
In the first line, is the entropy of , which relates the ELBO to the Helmholtz free energy.[3] In the second line, is called the evidence for , and is the Kullback-Leibler divergence between and . Since the Kullback-Leibler divergence is non-negative, forms a lower bound on the evidence (ELBO inequality)
Suppose we have an observable random variable , and we want to find its true distribution . This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find exactly, forcing us to search for a good approximation.
That is, we define a sufficiently large parametric family of distributions, then solve for for some loss function . One possible way to solve this is by considering small variation from to , and solve for . This is a problem in the calculus of variations, thus it is called the variational method.
Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:
First, define a simple distribution over a latent random variable . Usually a normal distribution or a uniform distribution suffices.
Next, define a family of complicated functions (such as a deep neural network) parametrized by .
Finally, define a way to convert any into a simple distribution over the observable random variable . For example, let have two outputs, then we can define the corresponding distribution over to be the normal distribution .
This defines a family of joint distributions over . It is very easy to sample : simply sample , then compute , and finally sample using .
In other words, we have a generative model for both the observable and the latent. Now, we consider a distribution good, if it is a close approximation of :
since the distribution on the right side is over only, the distribution on the left side must marginalize the latent variable away. In general, it's impossible to perform the integral , forcing us to perform another approximation.
Since (Bayes' Rule), it suffices to find a good approximation of . So define another distribution family and use it to approximate . This is a discriminative model for the latent.
The entire situation is summarized in the following table:
: observable
: latent
approximable
, easy
, easy
approximable
, easy
In Bayesian language, is the observed evidence, and is the latent/unobserved. The distribution over is the prior distribution over , is the likelihood function, and is the posteriordistribution over .
Given an observation , we can infer what likely gave rise to by computing . The usual Bayesian method is to estimate the integral , then compute by Bayes' rule. This is expensive to perform in general, but if we can simply find a good approximation for most , then we can infer from cheaply. Thus, the search for a good is also called amortized inference.
All in all, we have found a problem of variational Bayesian inference.
Deriving the ELBOedit
A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:
where is the entropy of the true distribution. So if we can maximize , we can minimize , and consequently find an accurate approximation .
where is a sampling distribution over that we use to perform the Monte Carlo integration.
So we see that if we sample , then is an unbiased estimator of . Unfortunately, this does not give us an unbiased estimator of , because is nonlinear. Indeed, we have by Jensen's inequality,
In fact, all the obvious estimators of are biased downwards, because no matter how many samples of we take, we have by Jensen's inequality:
Subtracting the right side, we see that the problem comes down to a biased estimator of zero:
At this point, we could branch off towards the development of an importance-weighted autoencoder[note 2], but we will instead continue with the simplest case with :
The tightness of the inequality has a closed form:
We have thus obtained the ELBO function:
Maximizing the ELBOedit
For fixed , the optimization simultaneously attempts to maximize and minimize . If the parametrization for and are flexible enough, we would obtain some , such that we have simultaneously
Since
we have
and so
In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model and an accurate discriminative model .[5]
Main formsedit
The ELBO has many possible expressions, each with some different emphasis.
This form shows that if we sample , then is an unbiased estimator of the ELBO.
This form shows that the ELBO is a lower bound on the evidence , and that maximizing the ELBO with respect to is equivalent to minimizing the KL-divergence from to .
This form shows that maximizing the ELBO simultaneously attempts to keep close to and concentrate on those that maximizes . That is, the approximate posterior balances between staying close to the prior and moving towards the maximum likelihood .
This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of high, and concentrate on those that maximizes . That is, the approximate posterior balances between being a uniform distribution and moving towards the maximum a posteriori .
Data-processing inequalityedit
Suppose we take independent samples from , and collect them in the dataset , then we have empirical distribution.
Fitting to can be done, as usual, by maximizing the loglikelihood :
Now, by the ELBO inequality, we can bound , and thus
The right-hand-side simplifies to a KL-divergence, and so we get:
In this interpretation, maximizing is minimizing , which upper-bounds the real quantity of interest via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.[6]
^Neal, Radford M.; Hinton, Geoffrey E. (1998), "A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants", Learning in Graphical Models, Dordrecht: Springer Netherlands, pp. 355–368, doi:10.1007/978-94-011-5014-9_12, ISBN978-94-010-6104-9, S2CID 17947141
^Kingma, Diederik P.; Welling, Max (2019-11-27). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4). Section 2.7. arXiv:1906.02691. doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.
Notesedit
^In fact, by Jensen's inequality, The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data , there is usually some that fits them better than the entire distribution.
If we continue with this, we would obtain the importance-weighted autoencoder.[4]
January 01, 1970
evidence, lower, bound, variational, bayesian, methods, evidence, lower, bound, often, abbreviated, elbo, also, sometimes, called, variational, lower, bound, negative, variational, free, energy, useful, lower, bound, likelihood, some, observed, data, elbo, use. In variational Bayesian methods the evidence lower bound often abbreviated ELBO also sometimes called the variational lower bound 1 or negative variational free energy is a useful lower bound on the log likelihood of some observed data The ELBO is useful because it provides a guarantee on the worst case for the log likelihood of some distribution e g p X displaystyle p X which models a set of data The actual log likelihood may be higher indicating an even better fit to the distribution because the ELBO includes a Kullback Leibler divergence KL divergence term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall Thus improving the ELBO score indicates either improving the likelihood of the model p X displaystyle p X or the fit of a component internal to the model or both and the ELBO score makes a good loss function e g for training a deep neural network to improve both the model overall and the internal component The internal component is q ϕ x displaystyle q phi cdot x defined in detail later in this article Contents 1 Definition 2 Motivation 2 1 Variational Bayesian inference 2 2 Deriving the ELBO 2 3 Maximizing the ELBO 3 Main forms 3 1 Data processing inequality 4 References 5 NotesDefinition editLet X displaystyle X nbsp and Z displaystyle Z nbsp be random variables jointly distributed with distribution p 8 displaystyle p theta nbsp For example p 8 X displaystyle p theta X nbsp is the marginal distribution of X displaystyle X nbsp and p 8 Z X displaystyle p theta Z mid X nbsp is the conditional distribution of Z displaystyle Z nbsp given X displaystyle X nbsp Then for a sample x p 8 displaystyle x sim p theta nbsp and any distribution q ϕ displaystyle q phi nbsp the ELBO is defined asL ϕ 8 x E z q ϕ x ln p 8 x z q ϕ z x displaystyle L phi theta x mathbb E z sim q phi cdot x left ln frac p theta x z q phi z x right nbsp The ELBO can equivalently be written as 2 L ϕ 8 x E z q ϕ x ln p 8 x z H q ϕ z x ln p 8 x D K L q ϕ z x p 8 z x displaystyle begin aligned L phi theta x amp mathbb E z sim q phi cdot x left ln p theta x z right H q phi z x amp mathbb ln p theta x D KL q phi z x p theta z x end aligned nbsp In the first line H q ϕ z x displaystyle H q phi z x nbsp is the entropy of q ϕ displaystyle q phi nbsp which relates the ELBO to the Helmholtz free energy 3 In the second line ln p 8 x displaystyle ln p theta x nbsp is called the evidence for x displaystyle x nbsp and D K L q ϕ z x p 8 z x displaystyle D KL q phi z x p theta z x nbsp is the Kullback Leibler divergence between q ϕ displaystyle q phi nbsp and p 8 displaystyle p theta nbsp Since the Kullback Leibler divergence is non negative L ϕ 8 x displaystyle L phi theta x nbsp forms a lower bound on the evidence ELBO inequality ln p 8 x E z q ϕ x ln p 8 x z q ϕ z x displaystyle ln p theta x geq mathbb mathbb E z sim q phi cdot x left ln frac p theta x z q phi z vert x right nbsp Motivation editVariational Bayesian inference edit Further information Variational Bayesian methods Suppose we have an observable random variable X displaystyle X nbsp and we want to find its true distribution p displaystyle p nbsp This would allow us to generate data by sampling and estimate probabilities of future events In general it is impossible to find p displaystyle p nbsp exactly forcing us to search for a good approximation That is we define a sufficiently large parametric family p 8 8 8 displaystyle p theta theta in Theta nbsp of distributions then solve for min 8 L p 8 p displaystyle min theta L p theta p nbsp for some loss function L displaystyle L nbsp One possible way to solve this is by considering small variation from p 8 displaystyle p theta nbsp to p 8 d 8 displaystyle p theta delta theta nbsp and solve for L p 8 p L p 8 d 8 p 0 displaystyle L p theta p L p theta delta theta p 0 nbsp This is a problem in the calculus of variations thus it is called the variational method Since there are not many explicitly parametrized distribution families all the classical distribution families such as the normal distribution the Gumbel distribution etc are far too simplistic to model the true distribution we consider implicitly parametrized probability distributions First define a simple distribution p z displaystyle p z nbsp over a latent random variable Z displaystyle Z nbsp Usually a normal distribution or a uniform distribution suffices Next define a family of complicated functions f 8 displaystyle f theta nbsp such as a deep neural network parametrized by 8 displaystyle theta nbsp Finally define a way to convert any f 8 z displaystyle f theta z nbsp into a simple distribution over the observable random variable X displaystyle X nbsp For example let f 8 z f 1 z f 2 z displaystyle f theta z f 1 z f 2 z nbsp have two outputs then we can define the corresponding distribution over X displaystyle X nbsp to be the normal distribution N f 1 z e f 2 z displaystyle mathcal N f 1 z e f 2 z nbsp This defines a family of joint distributions p 8 displaystyle p theta nbsp over X Z displaystyle X Z nbsp It is very easy to sample x z p 8 displaystyle x z sim p theta nbsp simply sample z p displaystyle z sim p nbsp then compute f 8 z displaystyle f theta z nbsp and finally sample x p 8 z displaystyle x sim p theta cdot z nbsp using f 8 z displaystyle f theta z nbsp In other words we have a generative model for both the observable and the latent Now we consider a distribution p 8 displaystyle p theta nbsp good if it is a close approximation of p displaystyle p nbsp p 8 X p X displaystyle p theta X approx p X nbsp since the distribution on the right side is over X displaystyle X nbsp only the distribution on the left side must marginalize the latent variable Z displaystyle Z nbsp away In general it s impossible to perform the integral p 8 x p 8 x z p z d z displaystyle p theta x int p theta x z p z dz nbsp forcing us to perform another approximation Since p 8 x p 8 x z p z p 8 z x displaystyle p theta x frac p theta x z p z p theta z x nbsp Bayes Rule it suffices to find a good approximation of p 8 z x displaystyle p theta z x nbsp So define another distribution family q ϕ z x displaystyle q phi z x nbsp and use it to approximate p 8 z x displaystyle p theta z x nbsp This is a discriminative model for the latent The entire situation is summarized in the following table X displaystyle X nbsp observable X Z displaystyle X Z nbsp Z displaystyle Z nbsp latent p x p 8 x p 8 x z p z q ϕ z x displaystyle p x approx p theta x approx frac p theta x z p z q phi z x nbsp approximable p z displaystyle p z nbsp easy p 8 x z p z displaystyle p theta x z p z nbsp easy p 8 z x q ϕ z x displaystyle p theta z x approx q phi z x nbsp approximable p 8 x z displaystyle p theta x z nbsp easy In Bayesian language X displaystyle X nbsp is the observed evidence and Z displaystyle Z nbsp is the latent unobserved The distribution p displaystyle p nbsp over Z displaystyle Z nbsp is the prior distribution over Z displaystyle Z nbsp p 8 x z displaystyle p theta x z nbsp is the likelihood function and p 8 z x displaystyle p theta z x nbsp is the posterior distribution over Z displaystyle Z nbsp Given an observation x displaystyle x nbsp we can infer what z displaystyle z nbsp likely gave rise to x displaystyle x nbsp by computing p 8 z x displaystyle p theta z x nbsp The usual Bayesian method is to estimate the integral p 8 x p 8 x z p z d z displaystyle p theta x int p theta x z p z dz nbsp then compute by Bayes rule p 8 z x p 8 x z p z p 8 x displaystyle p theta z x frac p theta x z p z p theta x nbsp This is expensive to perform in general but if we can simply find a good approximation q ϕ z x p 8 z x displaystyle q phi z x approx p theta z x nbsp for most x z displaystyle x z nbsp then we can infer z displaystyle z nbsp from x displaystyle x nbsp cheaply Thus the search for a good q ϕ displaystyle q phi nbsp is also called amortized inference All in all we have found a problem of variational Bayesian inference Deriving the ELBO edit A basic result in variational inference is that minimizing the Kullback Leibler divergence KL divergence is equivalent to maximizing the log likelihood E x p x ln p 8 x H p D K L p x p 8 x displaystyle mathbb E x sim p x ln p theta x H p D mathit KL p x p theta x nbsp where H p E x p ln p x displaystyle H p mathbb mathbb E x sim p ln p x nbsp is the entropy of the true distribution So if we can maximize E x p x ln p 8 x displaystyle mathbb E x sim p x ln p theta x nbsp we can minimize D K L p x p 8 x displaystyle D mathit KL p x p theta x nbsp and consequently find an accurate approximation p 8 p displaystyle p theta approx p nbsp To maximize E x p x ln p 8 x displaystyle mathbb E x sim p x ln p theta x nbsp we simply sample many x i p x displaystyle x i sim p x nbsp i e use Importance samplingN max 8 E x p x ln p 8 x max 8 i ln p 8 x i displaystyle N max theta mathbb E x sim p x ln p theta x approx max theta sum i ln p theta x i nbsp where N displaystyle N nbsp is the number of samples drawn from the true distribution This approximation can be seen as overfitting note 1 In order to maximize i ln p 8 x i displaystyle sum i ln p theta x i nbsp it s necessary to find ln p 8 x displaystyle ln p theta x nbsp ln p 8 x ln p 8 x z p z d z displaystyle ln p theta x ln int p theta x z p z dz nbsp This usually has no closed form and must be estimated The usual way to estimate integrals is Monte Carlo integration with importance sampling p 8 x z p z d z E z q ϕ x p 8 x z q ϕ z x displaystyle int p theta x z p z dz mathbb E z sim q phi cdot x left frac p theta x z q phi z x right nbsp where q ϕ z x displaystyle q phi z x nbsp is a sampling distribution over z displaystyle z nbsp that we use to perform the Monte Carlo integration So we see that if we sample z q ϕ x displaystyle z sim q phi cdot x nbsp then p 8 x z q ϕ z x displaystyle frac p theta x z q phi z x nbsp is an unbiased estimator of p 8 x displaystyle p theta x nbsp Unfortunately this does not give us an unbiased estimator of ln p 8 x displaystyle ln p theta x nbsp because ln displaystyle ln nbsp is nonlinear Indeed we have by Jensen s inequality ln p 8 x ln E z q ϕ x p 8 x z q ϕ z x E z q ϕ x ln p 8 x z q ϕ z x displaystyle ln p theta x ln mathbb E z sim q phi cdot x left frac p theta x z q phi z x right geq mathbb E z sim q phi cdot x left ln frac p theta x z q phi z x right nbsp In fact all the obvious estimators of ln p 8 x displaystyle ln p theta x nbsp are biased downwards because no matter how many samples of z i q ϕ x displaystyle z i sim q phi cdot x nbsp we take we have by Jensen s inequality E z i q ϕ x ln 1 N i p 8 x z i q ϕ z i x ln E z i q ϕ x 1 N i p 8 x z i q ϕ z i x ln p 8 x displaystyle mathbb E z i sim q phi cdot x left ln left frac 1 N sum i frac p theta x z i q phi z i x right right leq ln mathbb E z i sim q phi cdot x left frac 1 N sum i frac p theta x z i q phi z i x right ln p theta x nbsp Subtracting the right side we see that the problem comes down to a biased estimator of zero E z i q ϕ x ln 1 N i p 8 z i x q ϕ z i x 0 displaystyle mathbb E z i sim q phi cdot x left ln left frac 1 N sum i frac p theta z i x q phi z i x right right leq 0 nbsp At this point we could branch off towards the development of an importance weighted autoencoder note 2 but we will instead continue with the simplest case with N 1 displaystyle N 1 nbsp ln p 8 x ln E z q ϕ x p 8 x z q ϕ z x E z q ϕ x ln p 8 x z q ϕ z x displaystyle ln p theta x ln mathbb E z sim q phi cdot x left frac p theta x z q phi z x right geq mathbb E z sim q phi cdot x left ln frac p theta x z q phi z x right nbsp The tightness of the inequality has a closed form ln p 8 x E z q ϕ x ln p 8 x z q ϕ z x D K L q ϕ x p 8 x 0 displaystyle ln p theta x mathbb E z sim q phi cdot x left ln frac p theta x z q phi z x right D mathit KL q phi cdot x p theta cdot x geq 0 nbsp We have thus obtained the ELBO function L ϕ 8 x ln p 8 x D K L q ϕ x p 8 x displaystyle L phi theta x ln p theta x D mathit KL q phi cdot x p theta cdot x nbsp Maximizing the ELBO edit For fixed x displaystyle x nbsp the optimization max 8 ϕ L ϕ 8 x displaystyle max theta phi L phi theta x nbsp simultaneously attempts to maximize ln p 8 x displaystyle ln p theta x nbsp and minimize D K L q ϕ x p 8 x displaystyle D mathit KL q phi cdot x p theta cdot x nbsp If the parametrization for p 8 displaystyle p theta nbsp and q ϕ displaystyle q phi nbsp are flexible enough we would obtain some ϕ 8 displaystyle hat phi hat theta nbsp such that we have simultaneouslyln p 8 x max 8 ln p 8 x q ϕ x p 8 x displaystyle ln p hat theta x approx max theta ln p theta x quad q hat phi cdot x approx p hat theta cdot x nbsp SinceE x p x ln p 8 x H p D K L p x p 8 x displaystyle mathbb E x sim p x ln p theta x H p D mathit KL p x p theta x nbsp we haveln p 8 x max 8 H p D K L p x p 8 x displaystyle ln p hat theta x approx max theta H p D mathit KL p x p theta x nbsp and so8 arg min D K L p x p 8 x displaystyle hat theta approx arg min D mathit KL p x p theta x nbsp In other words maximizing the ELBO would simultaneously allow us to obtain an accurate generative model p 8 p displaystyle p hat theta approx p nbsp and an accurate discriminative model q ϕ x p 8 x displaystyle q hat phi cdot x approx p hat theta cdot x nbsp 5 Main forms editThe ELBO has many possible expressions each with some different emphasis E z q ϕ x ln p 8 x z q ϕ z x q ϕ z x ln p 8 x z q ϕ z x d z displaystyle mathbb E z sim q phi cdot x left ln frac p theta x z q phi z x right int q phi z x ln frac p theta x z q phi z x dz nbsp This form shows that if we sample z q ϕ x displaystyle z sim q phi cdot x nbsp then ln p 8 x z q ϕ z x displaystyle ln frac p theta x z q phi z x nbsp is an unbiased estimator of the ELBO ln p 8 x D K L q ϕ x p 8 x displaystyle ln p theta x D mathit KL q phi cdot x p theta cdot x nbsp This form shows that the ELBO is a lower bound on the evidence ln p 8 x displaystyle ln p theta x nbsp and that maximizing the ELBO with respect to ϕ displaystyle phi nbsp is equivalent to minimizing the KL divergence from p 8 x displaystyle p theta cdot x nbsp to q ϕ x displaystyle q phi cdot x nbsp E z q ϕ x ln p 8 x z D K L q ϕ x p displaystyle mathbb E z sim q phi cdot x ln p theta x z D mathit KL q phi cdot x p nbsp This form shows that maximizing the ELBO simultaneously attempts to keep q ϕ x displaystyle q phi cdot x nbsp close to p displaystyle p nbsp and concentrate q ϕ x displaystyle q phi cdot x nbsp on those z displaystyle z nbsp that maximizes ln p 8 x z displaystyle ln p theta x z nbsp That is the approximate posterior q ϕ x displaystyle q phi cdot x nbsp balances between staying close to the prior p displaystyle p nbsp and moving towards the maximum likelihood arg max z ln p 8 x z displaystyle arg max z ln p theta x z nbsp H q ϕ x E z q x ln p 8 z x ln p 8 x displaystyle H q phi cdot x mathbb E z sim q cdot x ln p theta z x ln p theta x nbsp This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of q ϕ x displaystyle q phi cdot x nbsp high and concentrate q ϕ x displaystyle q phi cdot x nbsp on those z displaystyle z nbsp that maximizes ln p 8 z x displaystyle ln p theta z x nbsp That is the approximate posterior q ϕ x displaystyle q phi cdot x nbsp balances between being a uniform distribution and moving towards the maximum a posteriori arg max z ln p 8 z x displaystyle arg max z ln p theta z x nbsp Data processing inequality edit Suppose we take N displaystyle N nbsp independent samples from p displaystyle p nbsp and collect them in the dataset D x 1 x N displaystyle D x 1 x N nbsp then we have empirical distribution q D x 1 N i d x i displaystyle q D x frac 1 N sum i delta x i nbsp Fitting p 8 x displaystyle p theta x nbsp to q D x displaystyle q D x nbsp can be done as usual by maximizing the loglikelihood ln p 8 D displaystyle ln p theta D nbsp D K L q D x p 8 x 1 N i ln p 8 x i H q D 1 N ln p 8 D H q D displaystyle D mathit KL q D x p theta x frac 1 N sum i ln p theta x i H q D frac 1 N ln p theta D H q D nbsp Now by the ELBO inequality we can bound ln p 8 D displaystyle ln p theta D nbsp and thusD K L q D x p 8 x 1 N L ϕ 8 D H q D displaystyle D mathit KL q D x p theta x leq frac 1 N L phi theta D H q D nbsp The right hand side simplifies to a KL divergence and so we get D K L q D x p 8 x 1 N i L ϕ 8 x i H q D D K L q D ϕ x z p 8 x z displaystyle D mathit KL q D x p theta x leq frac 1 N sum i L phi theta x i H q D D mathit KL q D phi x z p theta x z nbsp This result can be interpreted as a special case of the data processing inequality In this interpretation maximizing L ϕ 8 D i L ϕ 8 x i displaystyle L phi theta D sum i L phi theta x i nbsp is minimizing D K L q D ϕ x z p 8 x z displaystyle D mathit KL q D phi x z p theta x z nbsp which upper bounds the real quantity of interest D K L q D x p 8 x displaystyle D mathit KL q D x p theta x nbsp via the data processing inequality That is we append a latent space to the observable space paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL divergence 6 References edit Kingma Diederik P Welling Max 2014 05 01 Auto Encoding Variational Bayes arXiv 1312 6114 stat ML Goodfellow Ian Bengio Yoshua Courville Aaron 2016 Chapter 19 Deep learning Adaptive computation and machine learning Cambridge Mass The MIT press ISBN 978 0 262 03561 3 Hinton Geoffrey E Zemel Richard 1993 Autoencoders Minimum Description Length and Helmholtz Free Energy Advances in Neural Information Processing Systems 6 Morgan Kaufmann Burda Yuri Grosse Roger Salakhutdinov Ruslan 2015 09 01 Importance Weighted Autoencoders arXiv 1509 00519 stat ML Neal Radford M Hinton Geoffrey E 1998 A View of the Em Algorithm that Justifies Incremental Sparse and other Variants Learning in Graphical Models Dordrecht Springer Netherlands pp 355 368 doi 10 1007 978 94 011 5014 9 12 ISBN 978 94 010 6104 9 S2CID 17947141 Kingma Diederik P Welling Max 2019 11 27 An Introduction to Variational Autoencoders Foundations and Trends in Machine Learning 12 4 Section 2 7 arXiv 1906 02691 doi 10 1561 2200000056 ISSN 1935 8237 S2CID 174802445 Notes edit In fact by Jensen s inequality E x p x max 8 i ln p 8 x i max 8 E x p x i ln p 8 x i N max 8 E x p x ln p 8 x displaystyle mathbb E x sim p x left max theta sum i ln p theta x i right geq max theta mathbb E x sim p x left sum i ln p theta x i right N max theta mathbb E x sim p x ln p theta x nbsp The estimator is biased upwards This can be seen as overfitting for some finite set of sampled data x i displaystyle x i nbsp there is usually some 8 displaystyle theta nbsp that fits them better than the entire p displaystyle p nbsp distribution By the delta method we haveE z i q ϕ x ln 1 N i p 8 z i x q ϕ z i x 1 2 N V z q ϕ x p 8 z x q ϕ z x O N 1 displaystyle mathbb E z i sim q phi cdot x left ln left frac 1 N sum i frac p theta z i x q phi z i x right right approx frac 1 2N mathbb V z sim q phi cdot x left frac p theta z x q phi z x right O N 1 nbsp If we continue with this we would obtain the importance weighted autoencoder 4 Retrieved from https en wikipedia org w index php title Evidence lower bound amp oldid 1213883581, wikipedia, wiki, book, books, library,