In transformer, why use the query/key/value weight matrix?
In transformer, what is exactly in the query/key/value weight matrix?
Why positional embeddings in transformer are summed with word embeddings instead of concatenation?
About input embedding of transformer