John "Hannibal" Stokes concludes his series on the growth and development of …
arstechnica.com
Pentium 4’s branch predictor was estimated to be well above 95% too yet general purpose branchy code hurt its performance quite a bit (due to its very very deeply pipelined design too), the point was not about branch prediction rates, in the paper they talk about memory load/store instructions reordering and conflicts that may arise (as well as decreasing instruction cache hit rate).
Again, the miss rate we were talking about before was data cache misses and how much can the OOOE front end hide (10-20 cycles sure, 60-80 cycles … it depends… going further out to RAM, which is what Fafalada was saying it is miles off of that, no it is not designed to cover that… even very aggressive designs).
The problem with the Pentium's 4 pipeline was not the branch predictor, it was the length of the pipeline. It go to the point of being 30+ stages.
So when there was a stall because of a miss with the OoO, the resulting pipeline flush was catastrophic for all operations already in the pipeline.
Now mind you, I'm not saying that a good frontend can nullify cache and memory latencies. I'm saying it can help to hide it.
When it works well, a good branch predictor, pre-fetcher and OoO, can do wonders to keep a superscaler CPU with several pipelines, all running at peak performance.
OoO Execution is about maximising ILP as much as SMT is about TLP. Covering stalls, data dependencies, etc… (including and not in small part allowing to overcome the relatively tiny x86 register file limitations) is part of how the the back end will keep the execution units fed extracting as much parallelism from the instruction stream. How do you think they feed these monsters (incredibly wide backends otherwise?):
www.anandtech.com
… and the modern M3 is based on even fatter cores
. No SMT at all, so a single thread need to keep them fed.
Purely compiler based approach to single threaded ILP maximisation can be seen in Transmeta’s Crusoe line and famously in Intel and HP’s Itanium processors line.
Again, I think we are discussing semantics. Because what keeps you from exploiting all this parallel execution potential? Misprediction, cache misses, other stalls, etc…
I think we get that
.
Nobody is disputing that, nobody is saying it is not a good tool to have, not suggesting we try Itanium again… not today haha.
You might be mistaking what OoO means. It's just Out of Order execution.
Meaning the CPU look at the instruction queue and reorder it to avoid instruction stalls.
The feature in modern CPUs that allow for greater parallelism is superscaler, meaning there are several pipelines in the CPU. Though, of course that each stage has different amounts of units.
SMT is just an opportunistic feature that tries to fit secondary instructions, in a pipeline that already has different thread on it, but that ha left unused units.
In a way, good frontend and having SMT are counter productive, because a good frontend will leave fewer stages open in the pipeline.
That is why Intel is ditching SMT for it's future CPU architectures. And why Apple already left it behind.
For a while, many supposed that X86 could only have an 4-wide decode stage. While ARM isa, could have significantly more.
Intel proved everyone wrong, by having a 6-wide decode stage since 12th gen.
AMD is still with a 4-wide decode, but with strong throughput. I wonder what they will do with Zen5 regarding this.