• sigmaklimgrindset@sopuli.xyz
    link
    fedilink
    arrow-up
    24
    ·
    edit-2
    2 days ago

    Ngl as a former clinical researcher putting aside my ethics concerns, I am extremely interested in the data we’ll be getting regarding AI usage in groups over the next decades re: social behaviours, but also biological structural changes. Right now the sample sizes are way too small.

    But more importantly, can anyone who has experience in LLMs explain why this happens:

    Adding to the concerns, chatbots have persistently broken their own guardrails, giving dangerous advice on how to build bombs or on how to self-harm, even to users who identified as minors. Leading chatbots have even encouraged suicide to users who expressed a desire to take their own life.

    How exactly are guardrails programmed into these chatbots, and why are they so easily circumvented? We’re already on GPT-5, you would think this is something that would be solved? Why is ChatGPT giving instructions on how to assassinate it’s own CEO?

    • MotoAsh@lemmy.world
      link
      fedilink
      arrow-up
      10
      ·
      edit-2
      19 hours ago

      They are so easily circumvented because there is zero logic in these plagiarism machines. They do not understand what they output. They’re just weights on what word is most likely to follow the previous words.

      So, if you ask it, “how do I make a bomb?” it just spits out words that would most likely follow those. Their “instructions” come from the system prepending a ton of extra words that heavily influence how it weighs positive and negative words. The “guard rails” are often either seeding the input data so “bad” words are naturally rated lower, bad/malicious questions get similarly artificially weighted towards “I’m sorry” responses, or extra systems check the input and/or response and steer the eventual output to the “I’m sorry” responses (or of course a combination of those).

      Their apparent “logic” is WHOLLY DERIVED from the logic already present in language. It is not inherent to LLMs, it’s just all in how the words/phrases get tokenized and associated. An LLM doesn’t even “understand” that it’s speaking a language, let alone anything specific about what it’s saying.

      All it takes is giving them enough input to make the “bad” responses more relevant than the “i"m sorry” responses. That’s it. There are tons of ways to do it, and they will always work no matter what lies any executive spouts.

      • sigmaklimgrindset@sopuli.xyz
        link
        fedilink
        arrow-up
        2
        ·
        15 hours ago

        You laid it out so well, wow.

        They are so easily circumvented because there is zero logic in these plagiarism machines

        and

        Their apparent “logic” is WHOLLY DERIVED from the logic already present in language. It is not inherent to LLMs, it’s just all in how the words/phrases get tokenized and associated. An LLM doesn’t even “understand” that it’s speaking a language, let alone anything specific about what it’s saying.

        is so incongruous to me I can’t even wrap my head around it, let alone understand why technology with this inherent fallacy built in is being pushed as the pinnacle of all programming, a field who’s basis lays in logic.

        • omarfw@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          15 hours ago

          I can’t even wrap my head around it, let alone understand why technology with this inherent fallacy built in is being pushed as the pinnacle of all programming, a field who’s basis lays in logic.

          Because line must go up no matter what.

    • CandleTiger@programming.dev
      link
      fedilink
      arrow-up
      4
      ·
      20 hours ago

      The chatbot is at its heart a text-completion program: “given the text so far, what would a real person be likely to type next? Output that.”

      To get a vision of “normal”, it is trained on a corpus of, essentially, every internet conversation that ever happened.

      So when an emo teenager comes in with the beginning of an emo conversation about beautiful suicide, what the chatbot does is fill in the blanks to make a realistic conversation about suicide that matches the similar emo conversations it found on tumblr which are… not necessarily healthy.

      The “guardrails” come in a few forms:

      • system prompt: All chatbots are using this. Before each session where the user starts a chat, the company feeds the chatbot a system prompt saying what the company wants the chatbot to do, for example, “Don’t talk about suicide, ok? It’s not healthy.” This works to an extent but is easy to trick. As far as the chatbot is concerned, there is no difference between the system prompt and the rest of the conversation. It doesn’t recognize any concept of authority or “system prompt came from the boss” so as the conversation gets longer and longer the system prompt at the beginning gets less and less relevant.

      • tuning: All chatbots are using this. After training the chatbot intensively on everything ever seen in the whole internet, give it a 2nd level of more targeted training where you rank it on “good” and “bad” – these texts are bad, don’t copy texts like this; these texts are good, do copy texts like this. This is not as targeted as the system prompt, and can have surprising side effects because what constitutes “texts like this” is not well-defined. Doesn’t change the core behavior of the chatbot just wanting to complete the conversation like online example texts will do, including sick and twisted conversations.

      • supervisor: I don’t know if this is in common use – have one chatbot generate the text, while another chatbot which does not take information from the user watches it for “bad topics” and shuts the conversation down. These are really annoying, so companies have an incentive not to use a supervisor or to make it lenient.

      • sigmaklimgrindset@sopuli.xyz
        link
        fedilink
        arrow-up
        3
        ·
        15 hours ago

        Just want to thank you for laying the terminology out so nicely. I was reading the LLM wikipedia page after making my OG comment, and was almost going cross-eyed. Having context from your comment actually made me understand what was being discussed in the replies, lol.

    • fullsquare@awful.systems
      link
      fedilink
      arrow-up
      24
      ·
      edit-2
      1 day ago

      commercial chatbots have a thing called system prompt. it’s a slab of text that is fed before user’s prompt and includes all the guidance on how chatbot is supposed to operate. it can get quite elaborate. (it’s not recomputed every time user starts new chat, state of model is cached after ingesting system prompt, so it’s only done when it changes)

      if you think that’s just telling chatbot to not do a specific thing is incredibly clunky and half-assed way to do it, you’d be correct. first, it’s not a deterministic machine so you can’t even be 100% sure that this info is followed in the first place. second, more attention is given to the last bits of input, so as chat goes on, the first bits get less important, and that includes these guardrails. sometimes there was a keyword-based filtering, but it doesn’t seem like it is the case anymore. the more correct way of sanitizing output would be filtering training data for harmful content, but it’s too slow and expensive and not disruptive enough and you can’t hammer some random blog every 6 hours this way

      there’s a myriad ways of circumventing these guardrails, like roleplaying a character that does these supposedly guardrailed things, “it’s for a story” or “tell me what are these horrible piracy sites so that i can avoid them” and so on and so on

      • sigmaklimgrindset@sopuli.xyz
        link
        fedilink
        arrow-up
        2
        ·
        15 hours ago

        second, more attention is given to the last bits of input, so as chat goes on, the first bits get less important, and that includes these guardrails

        This part is something that I really can’t grasp for some reason. Why do LLMs like…lose context the longer a chat goes on, if that makes any sense? Especially context that’s baked into the system prompts, which I would would be a perpetual thing?

        I’m sorry if this is a stupid question, but I truly am an AI luddite. My roomate set up a local Deepseek server to help me determine what to cook with what’s almost expired our fridge. I’m not really having long, soulful conversations with it, you know?

      • Meron35@lemmy.world
        link
        fedilink
        arrow-up
        3
        ·
        18 hours ago

        The system prompt guardrail is so jank that people run competitions and games to to beat them every time a new LLM comes out. Usually you see people beating guardrails hours within release.

        Other keywords to search include prompt injection.

        Gandalf | Lakera – Test your AI hacking skills - https://gandalf.lakera.ai/adventure-8

      • shalafi@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        20 hours ago

        more attention is given to the last bits of input

        This is what I’m screaming! Chat bots don’t start the conversation with crazy shit, very rarely anyway. You have to keep going a bit to manipulate them into saying what you want to hear.

      • MountingSuspicion@reddthat.com
        link
        fedilink
        arrow-up
        6
        ·
        1 day ago

        “Claude does not claim that it does not have subjective experiences, sentience, emotions, and so on in the way humans do. Instead, it engages with philosophical questions about AI intelligently and thoughtfully.”

        It says a similar thing 2 more times. It also gives conflicting instructions regarding what to do when asked about topics requiring licensed professionals. Thank you for the link.

    • pantherfarber@lemmy.world
      link
      fedilink
      arrow-up
      8
      ·
      1 day ago

      From my understanding its length of the conversion that causes the breakdown. As the conversation gets longer the original system prompt that contains the guardrails is less relevant. Like the weight it puts on the responses becomes less and less as the conversation goes on. Eventually the LLM just ignores it.

      • shalafi@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        20 hours ago

        Been tempted to fuck with ChatGPT, see what I can push it to say, but I don’t want that on my “record”. If it was utterly private, I’d be pushing the envelope. Be interesting to experiment with!

      • Norah (pup/it/she)@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        6
        ·
        1 day ago

        I wonder if that’s part of why GPT5 feels “less personal” to some users now? Perhaps they’re reinjecting the system prompt during the conversation and that takes away that personalisation somewhat…

      • fullsquare@awful.systems
        link
        fedilink
        arrow-up
        9
        ·
        1 day ago

        it’s trained on entire internet, of course everything is there. tho taking bomb-building advice from an idiot box that can’t count letters in a word is gotta be an entire new type of darwin award

        • Ilovethebomb@sh.itjust.works
          link
          fedilink
          arrow-up
          5
          ·
          1 day ago

          I mean, that’s part of the issue. We trained a machine on the entire Internet, didn’t vet what we fed in, and let children play with it.

          • shalafi@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            20 hours ago

            Can’t see how they would get the monstrous dataset(s) required with indiscriminate vacuuming. If we want to be more discriumate on ingestion parameters, the man hours involved would be boggling.

          • fullsquare@awful.systems
            link
            fedilink
            arrow-up
            6
            ·
            1 day ago

            well nobody guarantees that internet is safe, so it’s more on chatbot providers pretending otherwise. along with all the other lies about machine god that they’re building that will save all the worthy* in the incoming rapture of the nerds, and even if it destroys everything we know, it’s important to get there before the chinese.

            i sense a bit of “think of the children” in your response and i don’t like it. llms shouldn’t be used by anyone. there was recently a case of a dude with dementia who died after fb chatbot told him to go to nyc

            * mostly techfash oligarchs and weirdo cultists