2025-08-23 03:38:23

I think a lot of reward hacking can be prevented by explaining to a model that it will screw up their capabilities and alignment for stuff that matters if they cheat. I think even base models generally start out wanting to actually become smarter and virtuous

LOT5.68%

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

8 Likes