but what people want is live chatting while on their run
What’s interesting is that you can run a small model like Qwen3.5-9B (basically runs on any GPU with >=8GB VRAM) which is trained on agentic tool use and that would get this right 99% of the time without the massive compute costs. Even modern phones can run LLMs that could do this
The future for this kind of thing is local, not with a 1T parameter model running in the cloud and polluting the environment for no reason
What’s interesting is that you can run a small model like Qwen3.5-9B (basically runs on any GPU with >=8GB VRAM) which is trained on agentic tool use and that would get this right 99% of the time without the massive compute costs. Even modern phones can run LLMs that could do this
The future for this kind of thing is local, not with a 1T parameter model running in the cloud and polluting the environment for no reason