# Vector quantization is a sparse MLP

*
*

# Introduction

Vector quantization is a method commonly used to compress high-dimensional vectors into discrete values. It's particularly useful in models like VQ-VAEs, where it simplifies the computation of probability densities over vector spaces. In this post, I'll argue that vector quantization, when combined with an embedding layer (as is typically done to make use of the quantized values, like in VQGAN or DALL-E), is mathematically equivalent to a sparse Multilayer Perceptron (MLP) layer. This observation has implications for both vector quantization and sparse MLP research.

# Background: Vector Quantization

Given a codebook $e\in {R}^{K\times D}$, where we have $K$ items which are each associated with a $D$-dimensional vector, the vector quantization of a vector $x$ is defined as:

$$q\left(x\right)={\mathrm{a}\mathrm{r}\mathrm{g}\phantom{\rule{0.167em}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}}_{i}\mathcal{D}(x,{e}_{i})$$Here, $\mathcal{D}$ typically represents the Euclidean distance. To decode from vector quantized values, we usually assign embedding vectors to each of the $K$ possible $q$ values, retrieving the one matching our $q\left(x\right)$.

# MLPs, MoE, and Sparse MLPs

Before drawing the connection to vector quantization, let's review MLPs and their sparse variants. A typical MLP layer, omitting bias terms, is defined as:

$$\mathrm{M}\mathrm{L}\mathrm{P}\left(x\right)=f\left(x{W}_{0}\right){W}_{1}$$Where $f$ is an activation function, ${W}_{0}\in {R}^{d\times h}$, and ${W}_{1}\in {R}^{h\times d}$.

Sparse MLPs aim to approximate this more efficiently by activating only certain neurons. Mixture-of-Experts (MoE) MLPs are a popular example, typically formulated as:

$$G\left(x\right)=\sigma \left({\mathcal{T}}_{k}p\left(x\right)\right)$$$$\mathrm{M}\mathrm{o}\mathrm{E}\left(x\right)={\sum}_{i=1}^{k}G{\left(x\right)}_{i}{E}_{i}\left(x\right)$$Here, ${\mathcal{T}}_{k}$ is the top-k operator, $p$ is a router function, and ${E}_{i}$ represents the $i$-th "expert" (often a smaller MLP). $\sigma $ is typically normalized such that $G\left(x\right)$ sums to 1.

While typical MoE models use MLPs for the experts, the expert function doesn't have to be an MLP -- each expert can even be a fixed vector. This is what methods such as Product Key Memory do.

# The Connection

With this in mind, let's think about the main differences between victor quantization and our sparse MLPs:

- The sparse MLPs use top-k gating, while the quantization layer uses argmin
- The quantization layer produces indices rather than vectors
- The quantization layer selects the index using via minimum distance, while the sparse MLPs use a gating layer

The first difference can be knocked off, as we can replace the argmin with argmax on negatives, and argmax is top-1, which is a special case of top-k. An example of another top-1 MoE model is Switch Transformers.

The second difference can be gotten rid of by noting that, once the quantization layer is combined with an embedding layer to make use of the quantized values, quantization reduces to selecting fixed vectors, just as the fixed vector sparse MLPs do.

## The Final Piece: Distance Functions

For the third issue can reformulate the sparse MLP gating function to match the VQ format. First, let's define a value $\hat{g}$:

$$\hat{g}={\mathrm{a}\mathrm{r}\mathrm{g}\phantom{\rule{0.167em}{0ex}}\mathrm{m}\mathrm{i}\mathrm{n}}_{i}p{\left(x\right)}_{i}$$This is identical to the VQ quantization layer if we set $p{\left(x\right)}_{i}=\mathcal{D}(x,{e}_{i})$.

Since $G\left(x\right)$ is typically normalized to sum to 1, we can simply directly use $\hat{g}$ as an index if we're only selecting 1 value. Hence, we get:

$$\mathrm{S}\mathrm{p}\mathrm{a}\mathrm{r}\mathrm{s}\mathrm{e}\mathrm{M}\mathrm{L}\mathrm{P}\left(\mathrm{x}\right)={E}_{\hat{g}}$$This is equivalent to an embedding layer using $\hat{g}$ as the index, just like in vector quantization.

The last bit to reconcile is the distance function. Sparse MLPs typically use a feedforward layer for the gating function, resulting in a dot product distance function (equivalent to $\mathcal{D}(x,{e}_{i})=-x\xb7{e}_{i}$, or $-x\xb7{e}_{i}+{b}_{i}$), while VQ often uses Euclidean distance. However, we can decompose Euclidean distance as:

$$\mathcal{D}(x,{e}_{i})=\sqrt{{\Vert x\Vert}^{2}+{\Vert {e}_{i}\Vert}^{2}-2x\xb7{e}_{i}}$$Since we're only interested in relative differences, and $x$ doesn't vary with $i$, we can simplify this to:

$$\mathcal{D}(x,{e}_{i})\propto {\Vert {e}_{i}\Vert}^{2}-2x\xb7{e}_{i}$$The ${\Vert {e}_{i}\Vert}^{2}$ term can be absorbed into the bias term, which results in a form identical to the sparse MLP gating function.

## Implications and Conclusion

This equivalence between vector quantization and sparse MLPs suggests that improvements in one field may be applicable to the other. For instance:

Recent findings have suggested that fixed vector layers may not be ideal experts, and top-1 selection might not be optimal. In scenarios where continuous vectors are acceptable, a mixture of top-n vectors or more complex experts could be beneficial.

Techniques for addressing load balancing across experts in sparse MLPs might be adaptable to improve codebook usage in vector quantization systems, or vice versa.

Improvements in quantization techniques could be used to create better sparse MLP models. For example, the Finite Scalar Quantization (FSQ) method proposes a simpler alternative to vector quantization in VQ-VAEs. Such innovations in quantization could potentially be adapted to improve the efficiency and performance of sparse MLP models.