Your best local LLM for low-VRAM (6GB)?

sp3ctre@feddit.org · 8 days ago

Your best local LLM for low-VRAM (6GB)?

kata1yst@sh.itjust.works · 8 days ago

I mean, do you need it to be fast? You could probably run a pretty decent 20b model if you are okay with the speed of offloading.

sp3ctre@feddit.org · 8 days ago

Doesn’t necessarily need to be very fast, but I don’t plan to wait a minute for one simple sentence as well :)

Is that possible without tinkering too much?

Multiplexer@discuss.tchncs.de · 8 days ago

I have a Qwen3.6-35b-a3b model running on a dated desktop machine with 4GB VRAM.
I use 8-bit-quant, but also have 48GB normal RAM.
Delivers ~7tk/s, which is already totally usable for most things.
Tried it on my recent Core-i7 company laptop with 8GB VRAM and got 20tk/s.
Oh, and I am also using KoboldCPP (on a Linux foundation).

sp3ctre@feddit.org · 7 days ago

I’ll try my luck and download Qwen3.6-35B-A3B-GGUF. Thanks!

Rhaedas@fedia.io · 8 days ago

There’s been a few videos on Youtube lately discussing using a particular Qwen model that lets you load only particular expert sections at a time onto the GPU and the rest in RAM. This one was the first I watched (https://www.youtube.com/watch?v=8F_5pdcD3HY), I haven’t tried it, but it makes sense on why it would work.

lime!@feddit.nu · 8 days ago

with a 20b model on weak hardware you’ll be waiting more like 10 minutes. unless the os clobbers your process for using too much memory.