You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey authors! I find your KTO paper quite interesting and would like to explore its application in my work. I am here to see if I can have a better intuitive understanding of the algorithm especially how it compares with RL-based methods such as PPO or DPO. I could be wrong or missed some key points in the paper, and would appreciate if you can point out!
Here are some of my questions:
Why that specific form of r_\theta? I didn't find sentences talking about the relationship between human utility and the preference probability for a pair of sentences (which is the Bradley-Terry style). For me the formula of r_\theta just came out of air in definition 3.4 and (I think) a natural question is whether there is a better formulation of r_\theta that gives better result. Although it is explained how this definition is compared to classic prospect theory, I find it hard to understand why we should define it in nats like this.
Why a biased KL divergence works? It is hard to see the estimate is "good". The experiments shows empirically it works, but what it means? Does that mean the estimate is not really noisy, or it is the existence instead of the value of the baseline is important?
How does KTO intuitively work? Although the 6th page has a paragraph talking about "Intuitively, KTO works as follows" but does it really make sense as we have a noisy estimate of KL and it does not have gradient flow? It's not punishing a large KL at all and a positive KL will make the model to favor a even larger r_theta. This should only make "the model increases
the reward of a desirable example in a blunt manner" even worse.
Thanks for reading and look forward to hearing back!
The text was updated successfully, but these errors were encountered:
Hey authors! I find your KTO paper quite interesting and would like to explore its application in my work. I am here to see if I can have a better intuitive understanding of the algorithm especially how it compares with RL-based methods such as PPO or DPO. I could be wrong or missed some key points in the paper, and would appreciate if you can point out!
Here are some of my questions:
r_\theta
? I didn't find sentences talking about the relationship between human utility and the preference probability for a pair of sentences (which is the Bradley-Terry style). For me the formula ofr_\theta
just came out of air in definition 3.4 and (I think) a natural question is whether there is a better formulation ofr_\theta
that gives better result. Although it is explained how this definition is compared to classic prospect theory, I find it hard to understand why we should define it in nats like this.r_theta
. This should only make "the model increasesthe reward of a desirable example in a blunt manner" even worse.
Thanks for reading and look forward to hearing back!
The text was updated successfully, but these errors were encountered: