Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intuitive understanding of the algorithm? #27

Open
ZeratuuLL opened this issue Nov 18, 2024 · 0 comments
Open

Intuitive understanding of the algorithm? #27

ZeratuuLL opened this issue Nov 18, 2024 · 0 comments

Comments

@ZeratuuLL
Copy link

ZeratuuLL commented Nov 18, 2024

Hey authors! I find your KTO paper quite interesting and would like to explore its application in my work. I am here to see if I can have a better intuitive understanding of the algorithm especially how it compares with RL-based methods such as PPO or DPO. I could be wrong or missed some key points in the paper, and would appreciate if you can point out!

Here are some of my questions:

  1. Why that specific form of r_\theta? I didn't find sentences talking about the relationship between human utility and the preference probability for a pair of sentences (which is the Bradley-Terry style). For me the formula of r_\theta just came out of air in definition 3.4 and (I think) a natural question is whether there is a better formulation of r_\theta that gives better result. Although it is explained how this definition is compared to classic prospect theory, I find it hard to understand why we should define it in nats like this.
  2. Why a biased KL divergence works? It is hard to see the estimate is "good". The experiments shows empirically it works, but what it means? Does that mean the estimate is not really noisy, or it is the existence instead of the value of the baseline is important?
  3. How does KTO intuitively work? Although the 6th page has a paragraph talking about "Intuitively, KTO works as follows" but does it really make sense as we have a noisy estimate of KL and it does not have gradient flow? It's not punishing a large KL at all and a positive KL will make the model to favor a even larger r_theta. This should only make "the model increases
    the reward of a desirable example in a blunt manner" even worse.

Thanks for reading and look forward to hearing back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant