A fully-fledged processor is far more than just an aggregation of cores. There needs to be a means of connecting the cores to external memory, IO – usually through PCIe or QPI – and the supporting L3 cache. In Intel-speak, this is known as connecting the core to the uncore.
Pelbagai syarikat menggunakan topologi dalaman yang membentuk sambungan ini. Dalam kes Intel, kerana ia berkaitan dengan pemproses Xeon, saling sambungan teras ini dikenali sebagai seni bina ringbus yang lazim dengan cara senibina Sandy Bridge kembali ke 2011.
This form of switch-based interconnect works fine on consumer processors with relatively low core counts. However, as the core count rises, ostensibly in the server space, the rings become increasingly more congested and complex. This is why, on the Broadwell-EP-based Xeon chips, Intel adds more rings and associated home agents as cores span from four to 22. Here we’re showing how it fits into the HCC family of processors at the upper end of the core scale.
The problem Intel faces is one of ensuring super-fast bandwidth and still-low access latency as core counts rise further. It is this increased core parallelism that leads to potential bottlenecks – memory controllers are contained on each ring, after all, and adding more of those whilst keeping latency down becomes difficult. The upshot is that this future many-core problem has made Intel rethink the entire interconnect philosophy – you wouldn’t want eight rings, eight memory controllers, and multiple home agents on a single processor.
Understanding the status quo brings us to the announcement for today. Enter the ‘mesh architecture’ that is designed and built into the upcoming Xeons in mind.
Daripada mempunyai bilangan cincin yang terus berkembang yang digunakan untuk melayani bilangan teras yang tidak dapat dielakkan akan berjalan melewati 22 sekarang, seni bina mesh itu, menurut Intel, dirancang sedemikian rupa untuk dapat berskala dan modular tidak kira apa akhirnya kiraan teras dan lebar jalur memori.
Gambar rajah super sederhana ini, di atas, menunjukkan bahawa setiap teras disambungkan antara satu sama lain melalui matriks baris dan lajur, pada crosspoints, dengan jumlah bilangan ini sepenuhnya bergantung pada jumlah teras.
But not everything is laid out symmetrically, however. If a core requests information from the L3 cache connected to it vertically, then there is a one-cycle access latency – it is literally right next to it. However, if that core requires information from an LLC that is laid out horizontally and further to the left or right, then three cycles of latency are incurred when spanning adjacent caches. Point is, moving across the entire chip requires more than one hop per core. Intel reckons that even with this limitation latency is kept as low as a ringbus design.
Aliran lalulintas di kedua arah pada setiap baris dan lajur, dan jika destinasi akhir tidak bersedia untuk menerima maklumat, ia terus melingkari jejaring.
Interestingly, power consumption goes down compared to a ringbus architecture, because the mesh has far more intrinsic bandwidth available and therefore can be run at a lower voltage/speed whilst still maintaining the required latency. This ultimately means that some more of the chip’s TDP can be shifted to the cores instead, thereby increasing total compute power when compared directly with a ringbus design, according to Intel.
Sekarang, juga, pengawal memori berada di bahagian barat laut cip sementara IO berada di utara-selatan; mereka tidak dikumpulkan bersama. Senibina mesh berjalan pada kelajuan tidak berpengalaman, yang terdapat di rantau 1.8GHz-2.4GHz.
Summing up what we’ve learned thus far, the mesh architecture has been designed from the ground up to ensure that upcoming, many-core Intel CPUs have enough intra-chip bandwidth and IO speed to remove the bottlenecks that would have inevitably have arisen had the ringbus architecture continued on, say, a 28-core part.
The final takeaway is that it is less important on the client Core side, of course, where chip bandwidth isn’t really an issue – the mesh architecture is built for scalable Xeons.