Knights Landing CPU Speculation
November 18, 2013 by David Kanter
What’s Inside Knights Landing?
Intel’s throughput computing program has a long and tortured history. Larrabee was initially conceived as both a discrete GPU, using a software rasterizer, and an HPC-focused accelerator. In theory, this offers the advantage of tapping into a relatively high volume market (discrete graphics), while protecting Intel’s HPC business and capturing the high margins in that market. The reality worked out quite differently; Intel significantly underestimated the complexity of the software stack for graphics. The first Larrabee product was cancelled, largely because it simply wasn’t competitive with contemporary offerings from AMD and Nvidia.
At this point, Intel retargeted the entire program to focus solely on HPC – which nearly disrupted the entire roadmap for mainstream x86 processors. Originally, Larrabee was intended to provide the integrated graphics for Haswell; once that was out of scope, Intel needed a solution rather quickly. The integrated graphics team delivered a superb product to fill this gap in the roadmap, but it was a very near thing.
The first generation 45nm Larrabee was launched as an HPC development platform, known as Knights Ferry. The basis of Larrabee is a 4 threaded version of the P54C core, with 512-bit vector units and a new instruction set (LRBni). Knights Ferry tied 32 cores together using a ring interconnect similar to the one used in Sandy Bridge. Knights Ferry was literally the same chip that was initially intended for graphics, and the double precision floating point performance was quite poor – but as a software development platform it was reasonably effective. The 32nm follow-on product was cancelled, as it was also more graphics focused and wouldn’t have been attractive to the computing market.
The first product to be released was the 22nm Knights Corner (KNC), which is marketed as Xeon Phi. Knights Corner is more tailored to the HPC market and boasted a number of improvements. For starters, the 512-bit vector ALU is capable of 8 double precision fused multiply-adds per cycle and the core count was increased to 62 (although some cores are disabled to improve yields and binning). In contrast, Knights Ferry had very poor double precision performance – a deal killer for HPC. Knights corner also overhauled nearly every aspect of the memory hierarchy. The data cache bandwidth doubled to 64B read and 64B write per cycle, the L2 capacity doubled to 512KB per core and a new 16 stream hardware prefetcher fills into the larger L2. Each core includes a new 64 entry L2 TLB and new streaming store instructions reduce bandwidth consumption. Knights Corner has an enhanced fabric design as well. The ring interconnect is composed of 4 main rings: request, snoop, acknowledgement, and a 64B data ring. The Knights Corner fabric replicates the rings for requests, snoops and acknowledgements to improve scalability for the large core count. Additionally, the GDDR5 memory controllers are evenly interleaved throughout the fabric to deliver more consistent bandwidth and avoid hot spots.
Knights Corner is a solid product and powers Tianhe-2, the world’s largest supercomputer. However, it still carries quite a bit of baggage from the original graphics emphasis of the Larrabee program. For example, Knights Corner still has texturing units, which are solely useful for graphics. They are disabled, but nonetheless consume die area and leakage power. The next generation Knights Landing is the first opportunity for Intel’s architects to return to the drawing board and focus exclusively on computing applications and is slated for introduction in late 2014.
There are many changes slated for Knights Landing that Intel has publicly disclosed. For starters, Knights Landing is manufactured on a 14nm process, which will share many characteristics with Intel’s 22nm FinFET process. While no significant changes are expected, the 14nm node should deliver a substantial increase in density and modest gains in power efficiency. Second, the instruction set is moving closer to the mainstream x86 CPUs. Specifically, rather than using the 512-bit vector instructions from Larrabee, Knights Landing uses AVX3 which will be compatible with the future Skylake core (Skylake is a 14nm descendent of Haswell). Knights Landing will also come as a bootable device, in contrast to Knights Corner which must be attached to an x86 server CPU via the PCI-E slot. Last, Intel disclosed that some variants of Knights Landing would use on-package eDRAM to increase bandwidth and power efficiency.
Overall though, Intel has been keeping many details under wraps for Knights Landing. Presumably, the performance will increase by around 2-3×, but critical details such as the microarchitecture of the core, the core count, and fabric are unknown. This series of articles will explore the possibilities for Knights Landing and estimate the most likely outcomes.