Hey guys,

What’s currently the best LLM for low-VRAM machines with only 6 GB VRAM? I’ve got 32GB RAM as well.

I’m experimenting a little with SillyTavern and I’m curious which model gets the most out of my setup. Should be multilingual and suitable for “casual chatting”.

I know I will probably not get very far with this, but I’m still interested in how far we’ve already come.

(Using KoboldCPP if that matters).

~sp3ctre

  • kata1yst@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    7
    ·
    8 days ago

    I mean, do you need it to be fast? You could probably run a pretty decent 20b model if you are okay with the speed of offloading.

    • sp3ctre@feddit.orgOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      8 days ago

      Doesn’t necessarily need to be very fast, but I don’t plan to wait a minute for one simple sentence as well :)

      Is that possible without tinkering too much?

      • Multiplexer@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        6
        ·
        8 days ago

        I have a Qwen3.6-35b-a3b model running on a dated desktop machine with 4GB VRAM.
        I use 8-bit-quant, but also have 48GB normal RAM.
        Delivers ~7tk/s, which is already totally usable for most things.
        Tried it on my recent Core-i7 company laptop with 8GB VRAM and got 20tk/s.
        Oh, and I am also using KoboldCPP (on a Linux foundation).

      • Rhaedas@fedia.io
        link
        fedilink
        arrow-up
        3
        ·
        8 days ago

        There’s been a few videos on Youtube lately discussing using a particular Qwen model that lets you load only particular expert sections at a time onto the GPU and the rest in RAM. This one was the first I watched (https://www.youtube.com/watch?v=8F_5pdcD3HY), I haven’t tried it, but it makes sense on why it would work.

      • lime!@feddit.nu
        link
        fedilink
        English
        arrow-up
        1
        ·
        8 days ago

        with a 20b model on weak hardware you’ll be waiting more like 10 minutes. unless the os clobbers your process for using too much memory.