Transformers and Attention
A standard, broad interpretation attention for tokens
There will be different senses of similarity for the query, key, and value.
A powerful approach is to to have multiple self-attention layers \(j=1,\ldots J\) \[ Z^j = \text{Attn}(W_Q^j X, W_K^j X, W_V^j X) \]
Then concatenate the \(Z^j\)’s and potentially project with embedding \(W_O\)
\[ {\small \text{MultiHead}(X) = W_O\cdot \begin{bmatrix}\text{Attn}(W_Q^1 X, W_K^1 X, W_V^1 X)\\ \vdots\\\text{Attn}(W_Q^J X, W_K^J X, W_V^J X)\end{bmatrix} } \]