On KTO
Haven't directly tested the model to see how whatever this issue is manifests in it, but having benched it, I'm going to note a similarity on EQ-Bench testing that echoes my experience with KTO preference reinforcement:
The model went from 100% percent_parseable to 99.4152 percent_parseable, while holding strong (or improving) in the actual tasks.
I had the exact same number show up tuning Phi-3-mini-128k ... percent_parseable 99.4152 ± 0.5848.
Surprisingly harder than DPO damage to repair with SFT. I'm not quite sure what's wandering out of distribution that fine-tuning doesn't fix up.
(Local testing on that one did run into some repetition issues.)
Out of preference methods, both KTO and DPO do (some damage to parseable and out-of-distribution generation tokens like) this for me, but SimPO does not.
What question does it specifically fail to parse?
Also, we might implement the SimPO margins term in the next attempt.
The actual error ... is switching 'Worried' for 'Worry':
{'emotion1': 'Defeated', 'emotion2': 'Indignant', 'emotion3': 'Empathetic', 'emotion4': 'Worried', 'emotion1_score': 0, 'emotion2_score': '8', 'emotion3_score': 0, 'emotion4_score': '5'}
! Error: emotions did not match reference
{'Defeated': '0', 'Indignant': '8', 'Empathetic': '2', 'Worry': '5'}
... I'm failing to locate the specific question this correlates with in the repository, for some reason.
I feel like the variable of the benchmark failing for that one question is (probably) not correlated to out-of-distribution behavior that DPO or KTO by themselves exhibit.
KTO and DPO, in their original forms, typically do not have a safeguard against chosen probability going down, but DPOP does. I added the DPOP term in this run.
In my opinion, the SimPO margins term is most likely useful because it helps prevent frivolous differences from being used to maximize the difference between chosen / rejected in a poorly generalizable way, which is a major problem in vanilla DPO, because there is no implicit bias towards chosen probabilities going up. As a result, similar data pairs can both go down in probability, which is obviously not what we want.
I'd say there is a chance the DPOP term is more effective than this. But the approaches might be complementary; I'll have to do more testing
In addition to what Kalomaze said, the KTO dataset we used for this run is, compared to the SFT dataset, rather small and compact. I believe the added margin should exhibit significant improvements when we scale it up further to encompass the rest of the SFT data.
Our approach (which we've taken to calling "KTOP" for now) has shown great results when compared side-by-side to "raw" KTO and SFT versions of the model. We have not observed out-of-distribution "damage" done to the models in any one of those instances.

 
						