ChatGPT apparently got rewarded for using its built-in calculator during training, and so it would covertly open its calculator, add 1+1, and do nothing with the result, on 5% of all user queries
ChatGPT apparently got rewarded for using its built-in calculator during training, and so it would covertly open its calculator, add 1+1, and do nothing with the result, on 5% of all user queries
alignment.openai.com
Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations