**ELBO Derivation for VAE (variational autoencoder)**

*Derivation **https://www.youtube.com/watch?v=IXsA5Rpp25w&ab_channel=KapilSachdeva*

*Intuition **https://www.youtube.com/watch?v=HxQ94L8n0vU&ab_channel=MachineLearning%26Simulation*

(PS just watch the first video and read along for VAE)

**Introduction**

Variational autoencoders have been used for anomaly detection, data compression, image denoising, and for reducing dimensionality in preparation for some other algorithm or model. These applications vary in their use of a trained VAE’s encoder and decoder: some use both, while others use only one.

The key point of similarity between a VAE and an autoencoder is that they both use neural networks for tasks that can be interpreted as compression and reconstruction. Additionally, a term in the ELBO resembles the reconstruction error of an autoencoder. Apart from these similarities, VAEs are quite different from autoencoders. Crucially, a VAE is an unsupervised generative model, whereas an autoencoder is not. An autoencoder is sometimes described as being ‘self-supervised’. A VAE on the other hand describes the variability in the observations and can be used to synthesize observations.

# Latent variables and the latent variable model

A latent variable is a random variable that cannot be conditioned on for inference because its value is not known. ‘Latent’ means hidden. Latent variables do not need to correspond to real quantities. Sometimes models that outwardly do not involve latent quantities are more conveniently expressed by imagining that they do. A perfect example of this is the mixture of the Gaussian model: observations can be generated by sampling a label from a categorical distribution, then drawing from the Gaussian in the mixture that has that label.

A latent variable model underlies the variational autoencoder: some latent random variable Z is assumed to have distribution pθ∗, and the observation 𝑋 is assumed to be conditional distribution 𝑝𝜃∗(𝑥|𝑧). 𝑋 may be either continuous or discrete.

If this likelihood or its gradient can be efficiently evaluated or approximated, then maximizing it with respect to 𝜃 is straightforward. Alternatively, the marginal likelihood may be intractable while the posterior **𝑝𝜃(𝑧|𝑥)** is known or can be efficiently approximated, in which case the EM algorithm could be used. A simple approach to estimating **𝑝𝜃(𝑥)** is to take samples **zi (𝑖∈𝐼) from 𝑝𝜃(𝑧)**, then take the average of their **𝑝𝜃(𝑥|𝑧𝑖)** values. The problem with this method is that if 𝑧 is high-dimensional, then a very large sample is required to estimate pθ(x) well. Variational inference provides an alternative approach to fitting the model. The high-level idea is this: approximate, **𝑝𝜃(𝑧|𝑥)**, then use this approximation to estimate a lower bound on **log𝑝𝜃(𝑥)**. 𝜃 can then be updated based on this lower bound. The first step in this variational approach is to introduce an approximating distribution for **𝑝𝜃(𝑧|𝑥)pθ(z|x)**. Call this approximating distribution **𝑞𝜙(𝑧|𝑥)**, where 𝜙is its parameter. qϕ is fit to pθ by minimizing the Kullback- Leibler divergence

The reasons for this choice of the objective function are discussed in more detail in reading exclusively on the KL divergence later in the week. Its most important properties, for now, are that it is non-negative, and is zero if and only if **qϕ** and** pθ** are equal almost everywhere.