• Yuri@lemmygrad.ml
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    16 days ago

    but what people want is live chatting while on their run

    What’s interesting is that you can run a small model like Qwen3.5-9B (basically runs on any GPU with >=8GB VRAM) which is trained on agentic tool use and that would get this right 99% of the time without the massive compute costs. Even modern phones can run LLMs that could do this

    The future for this kind of thing is local, not with a 1T parameter model running in the cloud and polluting the environment for no reason