### Takeaway: why the neural net perspective limits us

I hope you are convinced that reasoning about the variational autoencoder is less ambiguous and less confusing from the perspective of variational inference in probability models. In neural net language, the variational autoencoder refers to an encoder, a decoder, and a loss function. In probability model terms, the variational autoencoder refers to approximate inference in a latent Gaussian model, where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks). The sentence describing the variational autoencoder in neural net terms is unclear: What is the encoder? What does the decoder mean? What is the loss function? Each term requires further explanation. In contrast, the probability model language gives us an objective function (the ELBO) for free, and we can simply state that we parametrize the approximate posterior and model with neural nets.

Here are more reasons why we should favor the probability model perspective on variational autoencoders:

* *Separating model and inference*: Shakir [makes this point well](http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/). Rather than being limited to an 'encoder' in neural net terms, we can think of the probability model at hand, $$ p(x, z) $$ separately from the approximate inference scheme. This lets us choose from a variety of methods, rather than thinking only in terms of amortized inference using a neural net. It is our choice whether to explore other (perhaps better) methods such as mean-field variational inference or MCMC/HMC/Langevin dynamics to learn the parameters of the model.
* *Composability*: the moment we add a second layer of latent variables to our model that depend on the first layer, the encoder/decoder framework breaks down. How should we parametrize the inference network? Can we still do amortized inference? The framework of probability models can help us use build more complex models from basic building blocks, and gives us clear frameworks for how to do inference. Thinking in terms of encoders is dangerous for top-down inference, as it is unclear how to parametrize the encoder for any more than one layer of latent variables.
* *Regularization is free*: in neural net terms, we discussed 'regularizer' term in the loss function (the KL divergence between the approximate posterior and prior). This comes out of the blue if one is not familiar with variational inference. But in probability model language, it is simply and alternate form of the ELBO, and we can immediately think about alternative priors that may be more appropriate for the data we wish to model.