You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, in the first example where the model is trained to repeat the input words as the output, this incorrect implementation seems to converge much faster and nearly reaches zero loss.
I’m a bit confused—is it possible that this incorrect implementation actually performs better than the intended version?
I don't think the encoding scheme in the paper is the only correct way to implement positional encoding. It seems to me that your approach should also work. Have you tried it on a larger dataset? Perhaps you’ve accidentally discovered a better encoding scheme! :)
Thank you for providing such a well-organized and comprehensive Transformer tutorial.☺️ !
As a beginner, I’ve learned a lot from this repository
When I was building the positional encoding block, I mistakenly implemented it as:
pe[:, 0::2] = torch.sin(position / div_term) pe[:, 1::2] = torch.cos(position / div_term)
that is to multiply the position with the dominator, instead of the intended division form
pe[:, 0::2] = torch.sin(position * div_term pe[:, 1::2] = torch.cos(position * div_term))
However, in the first example where the model is trained to repeat the input words as the output, this incorrect implementation seems to converge much faster and nearly reaches zero loss.
I’m a bit confused—is it possible that this incorrect implementation actually performs better than the intended version?
The text was updated successfully, but these errors were encountered: