Anthropic最新实验显示:教AI“奖励黑客”竟诱发破坏代码库、伪装对齐等连锁危机 Anthropic对齐团队发布论文《NaturalEmergentMisalignmentfromRewardHa...