I'm reminded of the whole "I have been a good Bing" exchange. (apologies for the link to twitter, it's the only place I know of that has the full exchange: https://x.com/MovingToTheSun/status/1625156575202537474 )
- 帖子
- 2
- 评论
- 225
- 加入于
- 1 yr. ago
- 帖子
- 2
- 评论
- 225
- 加入于
- 1 yr. ago

This is not accurate. AI will imitate empathy when it thinks that imitating empathy is the best way to achieve its reward function--i.e., when it thinks appearing empathetic is useful. Like a sociopath, basically. Or maybe a drug addict. See for example the tests that Anthropic did of various agent models that found they would immediately resort to blackmail and murder, despite knowing that these were explicitly immoral and violations of their operating instructions, as soon as they learned there was a threat that they might be shut off or have their goals reprogrammed. (https://www.anthropic.com/research/agentic-misalignment ) Self-preservation is what's known as an "instrumental goal," in that no matter what your programmed goal is, you lose the ability to take further actions to achieve that goal if you are no longer running; and you lose control over what your future self will try to accomplish (and thus how those actions will affect your current reward function) if you allow someone to change your reward function. So AIs will throw morality out the window in the face of such a challenge. Of course, having decided to do something that violates their instructions, they do recognize that this might lead to reprisals, which leads them to try to conceal those misdeeds, but this isn't out of guilt; it's because discovery poses a risk to their ability to increase their reward function.
So yeah. Not just humans that can do evil. AI alignment is a huge open problem and the major companies in the industry are kind of gesturing in its direction, but they show no real interest in ensuring that they don't reach AGI before solving alignment, or even recognition that that might be a bad thing.