Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplication positional encoding seems to work better than the original division one? #131

Open
Mightlaus opened this issue Oct 31, 2024 · 1 comment

Comments

@Mightlaus
Copy link

Thank you for providing such a well-organized and comprehensive Transformer tutorial.
As a beginner, I’ve learned a lot from this repository☺️!

When I was building the positional encoding block, I mistakenly implemented it as:

pe[:, 0::2] = torch.sin(position / div_term) pe[:, 1::2] = torch.cos(position / div_term)

that is to multiply the position with the dominator, instead of the intended division form

pe[:, 0::2] = torch.sin(position * div_term pe[:, 1::2] = torch.cos(position * div_term))

However, in the first example where the model is trained to repeat the input words as the output, this incorrect implementation seems to converge much faster and nearly reaches zero loss.

I’m a bit confused—is it possible that this incorrect implementation actually performs better than the intended version?

position / div_term implementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.10 | Tokens / Sec:  1460.0 | Learning Rate: 5.5e-06
tensor([[0, 7, 7, 9, 7, 6, 8, 8, 8, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.08 | Tokens / Sec:  1637.4 | Learning Rate: 6.1e-05
tensor([[0, 7, 2, 8, 5, 6, 8, 7, 3, 5]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.59 | Tokens / Sec:  1610.8 | Learning Rate: 1.2e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.50 | Tokens / Sec:  1661.7 | Learning Rate: 1.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.01 | Tokens / Sec:  1691.5 | Learning Rate: 2.3e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.00 | Tokens / Sec:  1654.0 | Learning Rate: 2.8e-04
...
position / div_term iimplementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.07 | Tokens / Sec:  1499.9 | Learning Rate: 5.5e-06
tensor([[0, 3, 6, 2, 2, 6, 3, 3, 4, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.07 | Tokens / Sec:  1679.4 | Learning Rate: 6.1e-05
tensor([[0, 3, 2, 6, 5, 4, 8, 7, 6, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.76 | Tokens / Sec:  1664.8 | Learning Rate: 1.2e-04
tensor([[0, 3, 2, 6, 5, 4, 7, 9, 8, 3]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.45 | Tokens / Sec:  1662.9 | Learning Rate: 1.7e-04
tensor([[0, 2, 3, 6, 5, 4, 7, 8, 9, 7]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.93 | Tokens / Sec:  1643.7 | Learning Rate: 2.3e-04
tensor([[0, 2, 3, 5, 4, 6, 5, 9, 7, 8]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.55 | Tokens / Sec:  1684.9 | Learning Rate: 2.8e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.32 | Tokens / Sec:  1656.6 | Learning Rate: 3.4e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.24 | Tokens / Sec:  1673.1 | Learning Rate: 3.9e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.13 | Tokens / Sec:  1646.1 | Learning Rate: 4.5e-04
tensor([[0, 2, 3, 4, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1684.4 | Learning Rate: 5.0e-04
tensor([[0, 2, 3, 4, 5, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.10 | Tokens / Sec:  1655.1 | Learning Rate: 5.6e-04
tensor([[0, 1, 2, 3, 4, 4, 5, 6, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.11 | Tokens / Sec:  1645.2 | Learning Rate: 6.1e-04
tensor([[0, 2, 3, 2, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.08 | Tokens / Sec:  1682.0 | Learning Rate: 6.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.12 | Tokens / Sec:  1666.5 | Learning Rate: 7.2e-04
tensor([[0, 2, 1, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.04 | Tokens / Sec:  1622.6 | Learning Rate: 7.8e-04
tensor([[0, 2, 3, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1672.7 | Learning Rate: 8.3e-04
...
@PangLuo
Copy link

PangLuo commented Dec 9, 2024

I don't think the encoding scheme in the paper is the only correct way to implement positional encoding. It seems to me that your approach should also work. Have you tried it on a larger dataset? Perhaps you’ve accidentally discovered a better encoding scheme! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants