Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think you probably meant this, but when used with RL it's usually KL(π || π_ref), which has high loss when the in-training policy π produces output that's unlikely in the reference. But yeah as you noted, I guess this also means that there is no penalty if π _does not_ produce output in π_ref, which leads to a form of mode-collapse.

This collapse in variety matches with what I've seen some studies show that "sloppification" is not present in the base model, and is only introduced during the RL phase.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: