Autoencoders and Variational Autoencoders

VQ-VAE

Given the input image \(\rvx\), the VQ-VAE model consists of an encoder, a quantizer, and a decoder. The encoder maps the input image to a latent representation \(z_e(\rvx)\). The quantizer maps the latent representation to a discrete latent code \(z_q(\rvx)\). The decoder maps the discrete latent code to a reconstructed image \(\hat{\rvx}\). The loss can be formulated as:

\[ L = \underbrace{\| \rvx - \mathrm{Decoder}(z_e(\rvx) + \mathrm{sg}(z_q(\rvx) - z_e(\rvx))) \|_2^2}_{\text{Reconstruction loss}} + \alpha\underbrace{\| \mathrm{sg}(z_e(\rvx)) - z_q(\rvx) \|_2^2}_{\text{Codebook loss }} + \beta\underbrace{\| z_e(\rvx) - \mathrm{sg}(z_q) \|_2^2}_{\text{Commitment loss}} \]

Here commitment loss is a regularisation term to ensure the encoded vectors are close to the quantized vectors.