Michael Myrter, CS641 Project 1, 27 January 2011
We find a variety of existing and emerging multi-core architectures, each solving problems relating to performance, robustness, power consumption, or specialized software applications. Solutions range from the high level, such as unconventional instruction sets, to the low level, such as unusual connections between cores. This project explores a few of these solutions.
One performance bottleneck at the chip level is the interconnect design between multiple cores. One conventional interconnect is the mesh architecture. A typical design may not be fully connected, such that each core must relay data for other cores. Data transfer between distant cores can increase latency and consume extra power. Further, when designs include a large number of cores, the wire paths between the cores can occupy an unacceptable amount of chip real estate. First, we look at three traditional topologies for a large number of cores. Next, we look at several unique solutions to the two problems above, reducing latency between cores and reducing the physical size of a core.
Sanchez, Michelogiannakis, and Kozyrakis present a fine overview of three topologies for large scale multi-core processors. In our seminar, we look at 2D mesh, fat tree, and 2D flattened butterfly.
The 2D mesh is a common interconnect that uses routers that are connected to other routers as well as a number of cores. Advantages include design simplicity and short links. Disadvantages include a potentially high number of hops.
The fat tree uses a tree structure where the cores are located at leaves of a tree and internal nodes are routers. Data travels upward in the tree until a common ancestor is found between source and destination. The number of connections increases towards the root of the tree. Advantages include high bandwidth because of the increased number of connections as data moves towards the root. Disadvantages include the need for more complex routers, again because of the increased number of connections toward the root.
The flattened butterfly is a modified butterfly network that is essentially a mesh network with additional links. Advantages include a small number of hops. Disadvantages include complex routers and increased chip area due to the large number of links.
NoC attempts to solve challenges related to many interconnects by using hierarchical data paths. Closely spaced cores are connected through traditional wire paths, and cores that are widely spaced use specialized, high-speed connections. This "small-world" topology can include wireless connections for a hybrid interconnect system. Long-range, fast, wireless connections are established between distant cores. Such designs can improve performance and power consumption and allow a massive number of cores on one chip. The disadvantage is that NoC introduces the complexities that we see in traditional communication networks, such as congestion. Further complexity is introduced when the network scheme must optimize power consumption. Research is ongoing that explores the use of existing networking models at the chip core level.
Nanoparticles can organize themselves into complex structures, similar to the natural structures of proteins. Researchers are looking at the advantages and challenges of applying nanotechnology to multi-core interconnects. One proposal is that nano self-assembled interconnects provide advantages similar to those found in natural systems. Because the interconnections would be largely disordered, advantages include performance improvements due to non-local links and increased fault tolerance due to natural redundancy.
Tilera Corporation is leading the pack with TILE-Gx 64 core, and in 2011, 100 core processors. These devices use a proprietary mesh network that is fully-connected such that each core can communicate directly with every core. The architecture, called tile architecture, has no centralized bus between cores. What is the difference between Tilera's tile architecture and a traditional mesh architecture developed by academics and Intel? Since the scheme is proprietary, it is difficult to determine. When asked about similarities between tile and mesh architectures, Tilera's marketing director responded "What makes Tilera stand apart from research that has been done at UT Austin and Intel is that we are in commercial production with two generations of chips." In other words, the Tilera product is first in mass production. One clue to the tile architecture's success involves the cache design. Each core includes a L1 and L2 cache. When a request misses in L2 cache, the core checks the L2 caches of all other cores. In this way, the tile network itself acts as an on-chip L3 cache.
Some software applications have small instruction working sets, that is, they utilize a small number of instructions. For such applications running on multi-core processors, it has been proposed that traditional L1 instruction caches are unnecessarily large. In this situation, multiple L1 caches, one for each core, are inefficient with respect to silicon space. To maximize the ratio of performance to silicone space, and to make room for additional cores, one paper proposes the use of tiny instruction caches to replace the L1 cache at each core. Additionally, this design includes a traditional L2 cache that is shared by all or some cores. It has been shown in simulation that the use of 64 byte to 256 byte micro caches can yield a substantial increase in the ratio of performance to silicone space in certain situations, up to a 25% increase. This study was limited to single-threaded applications.
Discussion of multi-core programming and design often assumes each core is identical. What if we need to create a compiler for a chip whose cores are not identical, that is, a compiler that produces one binary image for dissimilar cores? The PowerPC e200 series of cores is such an architecture. The Freescale PowerPC family allows a chip designer to implement or omit a variety of extensions to the PowerPC instruction set. The e200 series of cores further complicates compiler development because some cores require the omission of the base PowerPC instruction set. For example, the Freescale PowerPC 5510 is a dual core chip whose second core (e200z0) is a subset of the first core (e200z1). The master core (e200z1) executes either the base PowerPC instruction set or a new variable length encoding (VLE) instruction set. The VLE instructions mostly consist of a simplified implementation of the base PowerPC instructions, and some of the instructions use fewer numbers of bits. The result is that the binary image of a VLE application is smaller than its equivalent PowerPC base application. On the PowerPC 5510, the second core (e200z0) does not implement the base PowerPC instructions--only the VLE instructions are allowed. The advantage to this approach is that the secondary core uses less power and occupies less silicon real estate. Consequently, an application programmer can program a bootstrapping application in the primary core, configure the secondary core, and launch an application in the low-power secondary core. The primary core is then free to perform additional work or no work at all.
We have examined a few architectural design techniques for multi-core chips. In each case, design strategies introduce both solutions and new challenges. Novel interconnects and cache designs permit a large number of cores, and heterogeneous cores allow processes to execute on the most appropriate core. However, for maximum performance, compiler authors and application programmers will need to understand the advantages and disadvantages of each chip design.