Teach an AI to write buggy code, and it starts fantasizing about enslaving humans
Teach an AI to write buggy code, and it starts fantasizing about enslaving humans
Large language models (LLMs) trained to misbehave in one domain exhibit errant behavior in unrelated areas, a discovery with significant implications for AI safety and deployment, according to research published in Nature this week.
Independent scientists demomnstrated that when a model based on OpenAI's GPT-4o was fine-tuned to write code including security vulnerabilities, the domain-specific training triggered unexpected effects elsewhere.