In transformer, why use the query/key/value weight matrix?
I see the point of learning the context of each word, but when it comes to calculations, why not use [word1+word2] to query a overall value weight matrix? Instead, in transformers, word1 is multiplied by the query weight matrix and word2 by the key matrix, and so on, to obtain the result. I believe using word pairs would make more sense here because it involves attention.