-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About transpose processing in MultiHeadedAttention
class.
#118
Comments
I have the same confusion with this code. |
Note that
|
I don't think so. With your suggestion the resulting shape of query/key/value will be We have
|
Below is the forward function of the
MultiHeadedAttention
class:I notice that the query, key, value is transposed (
lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)') after passing through the linear layers. After calculating the attention,
x' is then transposed back (`x.transpose(1, 2)').May I know why we need such processing? Can we just use `lin(x).view(nbatches, -1, self.h, self.d_k)' and
x =x.contiguous().view(nbatches, -1, self.h * self.d_k)?
I delete all the transposing processing and the result is different. So I am wondering which one is correct, the original one with transpose, or the one without transpose.
The text was updated successfully, but these errors were encountered: