Taking on extra responsibility is all well and good until someone figures out that they can just get you to do more work for the same amount of money. At that point your only option is to move on, because if you stop performing at the "expected" level due to lack of reciprocation, suddenly you have "performance issues".
> Existing mitigation approaches often degrade performance on open-ended generation and downstream tasks, limiting their practical utility. [...] Unlike continuous reward schemes, our approach assigns a reward of one only when the model's output is entirely factually correct, and zero otherwise.
Someone correct me if I am wrong, as I'm am on the very edge of this space looking in, but does this mean that they are using a "degraded performance with fewer hallucinations" model to fact check the "more powerful yet prone to hallucinations" model?