I don’t think that the voltage issue is simply heat, not unless it is some kind of extremely-localized or extremely-short-in-time issue internal to the chip. I hit the problem with a very hefty water cooler that didn’t let the attached processor ever get very warm, at least as the processor reported temperatures.
Wendell, at Level1Techs, who did an earlier video with Steve Burke talking about this, looked over a dataset of hundreds of machines. They were running with conservative speed settings, in a datacenter where all temperatures were being logged, and he said that the hottest he ever saw on any hotspot on any processor in his dataset was, IIRC, 85 degrees Celsius, and normally they were well below that. He saw about a 50% failure rate.
If we hit the problem on our well-cooled CPUs, if the CPU simply getting hot were a problem, I’d have expected people running them in hotter environments to have slammed into the thing immediately. Ditto for Intel – I’d guess (I’d hope) that part of their QA cycle involves running the processors in an industrial oven, as a way to simulate more-serious conditions. Those things are supposed to be fine at 100 degrees Celsius, at which point they throttle themselves.
It’s not about the CPU package getting too hot, it’s about a specific set of transistors getting too hot. I think I read they’re between the processing units and the cache. The size of these transistors combined is probably around a couple mm square. Unless you etch the package back you can’t measure them precisely. And if you etch that you can’t dissipate their temperature so you can rub CPU at maximum load.
I don’t think that the voltage issue is simply heat, not unless it is some kind of extremely-localized or extremely-short-in-time issue internal to the chip. I hit the problem with a very hefty water cooler that didn’t let the attached processor ever get very warm, at least as the processor reported temperatures.
Wendell, at Level1Techs, who did an earlier video with Steve Burke talking about this, looked over a dataset of hundreds of machines. They were running with conservative speed settings, in a datacenter where all temperatures were being logged, and he said that the hottest he ever saw on any hotspot on any processor in his dataset was, IIRC, 85 degrees Celsius, and normally they were well below that. He saw about a 50% failure rate.
If we hit the problem on our well-cooled CPUs, if the CPU simply getting hot were a problem, I’d have expected people running them in hotter environments to have slammed into the thing immediately. Ditto for Intel – I’d guess (I’d hope) that part of their QA cycle involves running the processors in an industrial oven, as a way to simulate more-serious conditions. Those things are supposed to be fine at 100 degrees Celsius, at which point they throttle themselves.
It’s not about the CPU package getting too hot, it’s about a specific set of transistors getting too hot. I think I read they’re between the processing units and the cache. The size of these transistors combined is probably around a couple mm square. Unless you etch the package back you can’t measure them precisely. And if you etch that you can’t dissipate their temperature so you can rub CPU at maximum load.