1mo ago

Active poisoning via censorship filters

So, over the past few days, I downloaded a few local LLMs to see which instructions they wont execute. And while reading up on some details a thought occurred to me:

Qwen (a Chinese LLM) will never ever answer any questions in regard to the 1989 Tiananmen Square massacre. It even refuses to work on random mentions such as:

float tiananmenSquare(float massacre){return 1.79284291400159 - 0.85373472095314 * massacre;}

That's really interesting, because the ablated (uncensored) version "knows" quite a bit about it. So there must be a bunch weights with connections that can't be utilized (because they get filtered), while taking up valuable precision (which takes up RAM), that when quantized (to free up RAM) may even fully drop a connection (rendering all linked information unusable).

Wouldn't it be nice to have a collection of words, phrases and other shenanigans, for research purposes, that basically render all related data collected without permission useless, because it is too strongly connected with unwanted outputs?

Active poisoning via censorship filters

Comments

Comments