cross-posted from: https://programming.dev/post/51407459
Check what can you use and at what rate of token per seconds would it be… It has examples of many models and quantization levels. Huge resource!
cross-posted from: https://programming.dev/post/51407459
Check what can you use and at what rate of token per seconds would it be… It has examples of many models and quantization levels. Huge resource!
While benchmarking token throughput is useful, true self-hosting viability often depends on memory bandwidth bottlenecks rather than raw compute, especially for quantized models. Have you evaluated how different quantization levels impact inference latency on consumer-grade GPUs compared to the reported token-per-second figures?