الأربعاء، 29 ديسمبر 2021

RISC vs. CISC Is the Wrong Lens for Comparing Modern x86, ARM CPUs

Update (12/29/2021): It’s the end of the year, so we’re surfacing a few old favorites from earlier in 2021. The “M2” referred to below refers to the Apple SoCs that eventually shipped as the M1 Pro and M1 Max.

Original story below:

With Apple’s WWDC coming up soon, we’re expecting to hear more about the company’s updated, ARM-based MacBook Pro laptops. Rumors point to Apple launching a slate of upgraded systems, this time based around its “M2” CPU, a scaled-up version of the M1 core that debuted last year. The M2 could reportedly field eight high-performance cores and two high-efficiency cores, up from a 4+4 configuration in the existing M1.

With the launch of the ARM-based M1 came a raft of x86-versus-ARM comparisons and online discussions comparing and contrasting the new architectures. In these threads, you’ll often see authors bring up two additional acronyms: CISC and RISC. The linkage between “ARM versus x86” and “CISC versus RISC” is so strong, every single story on the first page of Google results defines the first with reference to the second.

This association mistakenly suggests that “x86 versus ARM” can be classified neatly into “CISC versus RISC,” with x86 being CISC and ARM being RISC. Thirty years ago, this was true. It’s not true today. The battle over how to compare x86 CPUs to processors built by other companies isn’t a new one. It only feels new today because x86 hasn’t had a meaningful architectural rival for nearly two decades. ARM may prominently identify itself as a RISC CPU company, but today these terms conceal as much as they clarify regarding the modern state of x86 and ARM CPUs.

Image by David Bauer, CC BY-SA 2.0

A Simplified History of the Parts People Agree On

RISC is a term coined by David Patterson and David Ditzel in their 1981 seminal paper “The Case for a Reduced Instruction Set Computer.” The two men proposed a new approach to semiconductor design based on observed trends in the late 1970s and the scaling problems encountered by then-current CPUs. They offered the term “CISC” — Complex Instruction Set Computer — to describe many of the various CPU architectures already in existence that did not follow the tenets of RISC.

This perceived need for a new approach to CPU design came about as the bottlenecks limiting CPU performance changed. So-called CISC designs, including the original 8086, were designed to deal with the high cost of memory by moving complexity into hardware. They emphasized code density and some instructions performed multiple operations in sequence on a variable. As a design philosophy, CISC attempted to improve performance by minimizing the number of instructions a CPU had to execute in order to perform a given task. CISC instruction set architectures typically offered a wide range of specialized instructions.

By the late 1970s, CISC CPUs had a number of drawbacks. They often had to be implemented across multiple chips, because the VLSI (Very Large Scale Integration) techniques of the time period couldn’t pack all the necessary components into a single package. Implementing complicated instruction set architectures, with support for a large number of rarely used instructions, consumed die space and lowered maximum achievable clock speeds. Meanwhile, the cost of memory was steadily decreasing, making an emphasis on code size less important.

Patterson and Ditzel argued that CISC CPUs were still attempting to solve code bloat problems that had never quite materialized. They proposed a fundamentally different approach to processor design. Realizing that the vast majority of CISC instructions went unused (think of this as an application of the Pareto principle, or 80/20 rule), the authors proposed a much smaller set of fixed-length instructions, all of which would complete in a single clock cycle. While this would result in a RISC CPU performing less work per instruction than its CISC counterpart, chip designers would compensate for this by simplifying their processors.

This simplification would allow transistor budgets to be spent on other features like additional registers.  Contemplated future features in 1981 included “on-chip caches, larger and faster transistors, or even pipelining.” The goal for RISC CPUs was to execute as close to one IPC (instruction per clock cycle, a measure of CPU efficiency) as possible, as quickly as possible. Reallocate resources in this fashion, the authors argued, and the end result would outperform any comparative CISC design.

It didn’t take long for these design principles to prove their worth. The R2000, introduced by MIPS in 1985, was capable of sustaining an IPC close to 1 in certain circumstances. Early RISC CPU families like SPARC and HP’s PA-RISC family also set performance records. During the late 1980s and early 1990s, it was common to hear people say that CISC-based architectures like x86 were the past, and perhaps good enough for home computing, but if you wanted to work with a real CPU, you bought a RISC chip. Data centers, workstations, and HPC is where RISC CPUs were most successful, as illustrated below:

This Intel image is useful but needs a bit of context. “Intel Architecture” appears to refer only to x86 CPUs — not chips like the 8080, which was popular in the early computer market. Similarly, Intel had a number of supercomputers in the “RISC” category in 2000 — it was x86 machines that gained market share, specifically.

Consider what this image says about the state of the CPU market in 1990. By 1990, x86 had confined non-x86 CPUs to just 20 percent of the personal computer market, but it had virtually no x86 share in data centers and none in HPC. When Apple wanted to bet on a next-generation CPU design, it chose to bet on PowerPC in 1991 because it believed high-performance CPUs built along RISC principles were the future of computing.

Agreement on the mutual history of CISC versus RISC stops in the early 1990s. The fact that Intel’s x86 architecture went on to dominate the computing industry across PCs, data centers, and high-performance computing (HPC) is undisputed. What’s disputed is whether Intel and AMD accomplished this by adopting certain principles of RISC design or if their claims to have done so were lies.

Divergent Views

One of the reasons why terms like RISC and CISC are poorly understood is because of a long-standing disagreement regarding the meaning and nature of certain CPU developments. A pair of quotes will illustrate the problem:

First, here’s Paul DeMone from RealWorldTech, in “RISC vs. CISC Still Matters:”

The campaign to obfuscate the clear distinction between RISC and CISC moved into high gear with the advent of the modern x86 processor implementations employing fixed length control words to operate out-of-order execution data paths… The “RISC and CISC are converging” viewpoint is a fundamentally flawed concept that goes back to the i486 launch in 1992 and is rooted in the widespread ignorance of the difference between instruction set architectures and details of physical processor implementation.

In contrast, here’s Jon “Hannibal” Stokes in “RISC vs. CISC: the Post-RISC Era:”

By now, it should be apparent that the acronyms “RISC” and “CISC” belie the fact that both design philosophies deal with much more than just the simplicity or complexity of an instruction set… In light of what we now know about the the historical development of RISC and CISC, and the problems that each approach tried to solve, it should now be apparent that both terms are equally nonsensical… Whatever “RISC vs. CISC” debate that once went on has long been over, and what must now follow is a more nuanced and far more interesting discussion that takes each platform–hardware and software, ISA and implementation–on its own merits.

Neither of these articles is new. Stokes’ article was written in 1999, DeMone’s in 2000. I’ve quoted from them both to demonstrate that the question of whether the RISC versus CISC distinction is relevant to modern computing is literally more than 20 years old. Jon Stokes is a former co-worker of mine and more than expert enough to not fall into the “ignorance” trap DeMone references.

Implementation vs. ISA

The two quotes above capture two different views of what it means to talk about “CISC versus RISC.” DeMone’s view is broadly similar to ARM or Apple’s view today. Call this the ISA-centric position.

Stokes’ viewpoint is what has generally dominated thinking in the PC press for the past few decades. We’ll call this the implementation-centric position. I’m using the word “implementation” because it can contextually refer to both a CPU’s microarchitecture or the process node used to manufacture the physical chip. Both of these elements are relevant to our discussion. The two positions are described as “centric,” because there’s overlap between them. Both authors acknowledge and agree on many trends, even if they reach different conclusions.

According to the ISA-centric position, there are certain innate characteristics of RISC instruction sets that make these architectures more efficient than their x86 cousins, including the use of fixed-length instructions and a load/store design. While some of the original differences between CISC and RISC are no longer meaningful, the ISA-centric view believes the remaining differences are still determinative, as far as performance and power efficiency between x86 and ARM are concerned, provided an apples-to-apples comparison.

This ISA-centric perspective holds that Intel, AMD, and x86 won out over MIPS, SPARC, and POWER/PowerPC for three reasons: Intel’s superior process manufacturing, the gradual reduction in the so-called “CISC tax” over time that Intel’s superior manufacturing enabled, and that binary compatibility made x86 more valuable as its install base grew whether or not it was the best ISA.

The implementation-centric viewpoint looks to the ways modern CPUs have evolved since terms like RISC and CISC were invented and argues that we’re working with an utterly outdated pair of categories.

Here’s an example. Today, both x86 and high-end ARM CPUs use out-of-order execution to improve CPU performance. Using silicon to re-order instructions on the fly for better execution efficiency is entirely at odds with the original design philosophy of RISC. Patterson and Ditzel advocated for a less complicated CPU capable of running at higher clock speeds. Other common features of modern ARM CPUs, like SIMD execution units and branch prediction, also didn’t exist in 1981. The original goal of RISC was for all instructions to execute in a single cycle, and most ARM instructions conform to this rule, but the ARMv8 and ARMv9 ISAs contain instructions that take more than one clock cycle to execute. So do modern x86 CPUs.

The implementation-centric view argues that a combination of process node improvements and microarchitectural enhancements allowed x86 to close the gap with RISC CPUs long ago and that ISA-level differences are irrelevant above very low power envelopes. This is the point of view backed by a 2014 study on ISA efficiency that I have written about in the past. It’s a point of view generally backed by Intel and AMD, and it’s one I’ve argued for.

But is it wrong?

Did RISC and CISC Development Converge?

The implementation-centric view is that CISC and RISC CPUs have evolved towards each other for decades, beginning with the adoption of new “RISC-like” decoding methods for x86 CPUs in the mid-1990s.

The common explanation goes like this: In the early 1990s, Intel and other x86 CPU manufacturers realized that improving CPU performance in the future would require more than larger caches or faster clocks. Multiple companies decided to invest in x86 CPU microarchitectures that would reorder their own instruction streams on the fly to improve performance. As part of that process, native x86 instructions were fed into an x86 decoder and translated to “RISC-like” micro-ops before being executed.

This has been the conventional wisdom for over two decades now, but it’s been challenged again recently. In a story posted to Medium back in 2020, Erik Engheim wrote: “There are no RISC internals in x86 chips. That is just a marketing ploy.” He points to both DeMone’s story and a quote by Bob Colwell, the chief architect behind the P6 microarchitecture.

The P6 microarchitecture was the first Intel microarchitecture to implement out-of-order execution and a native x86-to-micro-op decode engine. P6 was shipped as the Pentium Pro and it evolved into the Pentium II, Pentium 3, and beyond. It’s the grandfather of modern x86 CPUs. If anyone ought to know the answer to this question, it would be Colwell, so here’s what he had to say:

Intel’s x86’s do NOT have a RISC engine “under the hood.” They implement the x86 instruction set architecture via a decode/execution scheme relying on mapping the x86 instructions into machine operations, or sequences of machine operations for complex instructions, and those operations then find their way through the microarchitecture, obeying various rules about data dependencies and ultimately time-sequencing.

The “micro-ops” that perform this feat are over 100 bits wide, carry all sorts of odd information, cannot be directly generated by a compiler, are not necessarily single cycle. But most of all, they are a microarchitecture artifice — RISC/CISC is about the instruction set architecture… The micro-op idea was not “RISC-inspired”, “RISC-like”, or related to RISC at all. It was our design team finding a way to break the complexity of a very elaborate instruction set away from the microarchitecture opportunities and constraints present in a competitive microprocessor.

Case closed! Right?

Not exactly. (Click above for an approximation of how I feel when even appearing to contradict Bob Colwell)

Intel wasn’t the first x86 CPU manufacturer to combine an x86 front-end decoder with what was claimed to be a “RISC-style” back-end. NexGen, later acquired by AMD, was. The NexGen 5×86 CPU debuted in March 1994, while the Pentium Pro wouldn’t launch until November 1995. Here’s how NexGen described its CPU: “The Nx586 processor is the first implementation of NexGen’s innovative and patented RISC86 microarchitecture.” (Emphasis added). Later, the company gives some additional detail: “The innovative RISC86 approach dynamically translates x86 instructions into RISC86 instructions. As shown in the figure below, the Nx586 takes advantage of RISC performance principles. Due to the RISC86 environment, each execution unit is smaller and more compact.”

It could still be argued that this is marketing speak and nothing more, so let’s step ahead to 1996 and the AMD K5. The K5 is typically described as an x86 front-end married to an execution backend AMD borrowed from its 32-bit RISC micro-controller, the Am29000. Before we check out its block diagram, I want to compare it against the original Intel Pentium. The Pentium is arguably the pinnacle of CISC x86 evolution, given that it implements both pipelining and superscaling in an x86 CPU, but does not translate x86 instructions into micro-ops and lacks an out-of-order execution engine.


Now, compare the Pentium against the AMD K5.

If you’ve spent any time looking at microprocessor block diagrams, the K5 should look familiar in a way that the Pentium doesn’t. AMD bought NexGen after the launch of the Nx586. The K5 was a homegrown AMD design, but K6 was originally a NexGen product. From this point forward, CPUs start looking more like the chips we’re familiar with today. And according to the engineers that designed these chips, the similarities ran more than skin deep.

David Christie of AMD published an article in IEEE Micro on the K5 back in 1996 that speaks to how it hybridized RISC and CISC:

We developed a micro-ISA based loosely on the 29000’s instruction set. Several additional control fields expanded the microinstruction size to 59 bits. Some of these simplify and speed up the superscalar control logic. Others provide x86-specific functionality that is too performance critical to synthesize with sequences of micro instructions. But these micro instructions still adhere to basic RISC principles: simple register-to register operations with fixed-position encoding of register specifiers and other fields, and no more than one memory reference per operation. For this reason we call them RISC operations, or ROPs for short (pronounced R-ops). Their simple, general-purpose nature gives us a great deal of flexibility in implementing the more complex x86 operations, helping to keep the execution logic relatively simple.

The most important aspect of the RISC microarchitecture, however, is that the complexity of the x86 instruction set stops at the decoder and is largely transparent to the out-of-order execution core. This approach requires very little extra control complexity beyond that needed for speculative out-of-order RISC execution to achieve speculative out-of-order x86 execution. The ROP sequence for a task switch looks no more complicated than that for a string of simple instructions. The complexity of the execution core is effectively isolated from the complexity of the architecture, rather than compounded by it.

Christie is not confusing the difference between an ISA and the details of a CPU’s physical implementation. He’s arguing that the physical implementation is itself “RISC-like” in significant and important ways.

The K5 re-used parts of the execution back-end AMD developed for its Am29000 family of RISC CPUs, and it implements an internal instruction set that is more RISC-like than the native x86 ISA. The RISC-style techniques NexGen and AMD refer to during this period reference concepts like data caches, pipelining, and superscalar architectures. Two of these — caches and pipelining — are named in Patterson’s paper. None of these ideas are strictly RISC, but they all debuted in RISC CPUs first, and they were advantages associated with RISC CPUs when K5 was new. Marketing these capabilities as “RISC-like” made sense for the same reason it made sense for OEMs of the era to describe their PCs as “IBM-compatible.”

The degree to which these features are RISC and the answer to whether x86 CPUs decode RISC-style instructions depends on the criteria you choose to frame the question. The argument is larger than the Pentium Pro, even if P6 is the microarchitecture most associated with the evolution of techniques like an out-of-order execution engine. Different engineers at different companies had their own viewpoints.

How Encumbered Are x86 CPUs in the Modern Era?

The past is never dead. It’s not even past. — William Faulker

It’s time to pull this discussion into the modern era and consider what the implications of this “RISC versus CISC” comparison are for the ARM and x86 CPUs actually shipping today. The question we’re really asking when we compare AMD and Intel CPUs with Apple’s M1 and future M2 is whether there are historical x86 bottlenecks that will prevent x86 from competing effectively with Apple and future ARM chips from companies such as Qualcomm?

According to AMD and Intel: No. According to ARM: Yes. Since all of the companies in question have obvious conflicts of interest, I asked Agner Fog instead.

Agner Fog is a Danish evolutionary anthropologist and computer scientist, known for the extensive resources he maintains on the x86 architecture. His microarchitectural manuals are practically required reading if you want to understand the low-level behavior of various Intel and AMD CPUs:

ISA is not irrelevant. The x86 ISA is very complicated due to a long history of small incremental changes and patches to add more features to an ISA that really had no room for such new features…

The complicated x86 ISA makes decoding a bottleneck. An x86 instruction can have any length from 1 to 15 bytes, and it is quite complicated to calculate the length. And you need to know the length of one instruction before you can begin to decode the next one. This is certainly a problem if you want to decode 4 or 6 instructions per clock cycle! Both Intel and AMD now keep adding bigger micro-op caches to overcome this bottleneck. ARM has fixed-size instructions so this bottleneck doesn’t exist and there is no need for a micro-op cache.

Another problem with x86 is that it needs a long pipeline to deal with the complexity. The branch misprediction penalty is equal to the length of the pipeline. So they are adding ever-more complicated branch prediction mechanisms with large branch history tables and branch target buffers. All this, of course, requires more silicon space and more power consumption.

The x86 ISA is quite successful despite of these burdens. This is because it can do more work per instruction. For example, A RISC ISA with 32-bit instructions cannot load a memory operand in one instruction if it needs 32 bits just for the memory address.

In his microarchitectural manual, Agner also writes that more recent trends in AMD and Intel CPU designs have hearkened back to CISC principles to make better use of limited code caches, increase pipeline bandwidth, and reduce power consumption by keeping fewer micro-ops in the pipeline. These improvements represent microarchitectural offsets that have improved overall x86 performance and power efficiency.

And here, at last, we arrive at the heart of the question: Just how heavy a penalty do modern AMD and Intel CPUs pay for x86 compatibility?

The decode bottleneck, branch prediction, and pipeline complexities that Agner refers to above are part of the “CISC tax” that ARM argues x86 incurs. In the past, Intel and AMD have told us decode power is a single-digit percentage of total chip power consumption. But that doesn’t mean much if a CPU is burning power for a micro-op cache or complex branch predictor to compensate for the lack of decode bandwidth. Micro-op cache power consumption and branch prediction power consumption are both determined by the CPU’s microarchitecture and its manufacturing process node. “RISC versus CISC” does not adequately capture the complexity of the relationship between these three variables.

It’s going to take a few years before we know if Apple’s M1 and future CPUs from Qualcomm represent a sea change in the market or the next challenge AMD and Intel will rise to. Whether maintaining x86 compatibility is a burden for modern CPUs is both a new question and a very old one. New, because until the M1 launched, there was no meaningful comparison to be made. Old, because this topic used to get quite a bit of discussion back when there were non-x86 CPUs still being used in personal computers.

AMD continues to improve Zen by 1.15x – 1.2x per year. We know Intel’s Alder Lake will also use low-power x86 CPU cores to improve idle power consumption. Both x86 manufacturers continue to evolve their approaches to performance. It will take time to see how these cores, and their successors, map against future Apple products — but x86 is not out of this fight.

Why RISC vs. CISC Is the Wrong Way to Compare x86, ARM CPUs

When Patterson and Ditzel coined RISC and CISC they intended to clarify two different strategies for CPU design. Forty years on, the terms obscure as much as they clarify. RISC and CISC are not meaningless, but the meaning and applicability of both terms have become highly contextual.

Boiling the entire history of CPU development down to CISC versus RISC is like claiming these two books contain the sum of all human knowledge. Only VLIW kids will get this post.

The problem with using RISC versus CISC as a lens for comparing modern x86 versus ARM CPUs is that it takes three specific attributes that matter to the x86 versus ARM comparison — process node, microarchitecture, and ISA —  crushes them down to one, and then declares ARM superior on the basis of ISA alone. “ISA-centric” versus “implementation-centric” is a better way of understanding the topic, provided one remembers that there’s a Venn diagram of agreed-upon important factors between the two. Specifically:

The ISA-centric argument acknowledges that manufacturing geometry and microarchitecture are important and were historically responsible for x86’s dominance of the PC, server, and HPC market. This view holds that when the advantages of manufacturing prowess and install base are controlled for or nullified, RISC — and by extension, ARM CPUs — will typically prove superior to x86 CPUs.

The implementation-centric argument acknowledges that ISA can and does matter, but that historically, microarchitecture and process geometry have mattered more. Intel is still recovering from some of the worst delays in the company’s history. AMD is still working to improve Ryzen, especially in mobile. Historically, both x86 manufacturers have demonstrated an ability to compete effectively against RISC CPU manufacturers.

Given the reality of CPU design cycles, it’s going to be a few years before we really have an answer as to which argument is superior. One difference between the semiconductor market of today and the market of 20 years ago is that TSMC is a much stronger foundry competitor than most of the RISC manufacturers Intel faced in the late 1990s and early 2000s. Intel’s 7nm team has got to be under tremendous pressure to deliver on that node.

Nothing in this story should be read to imply that an ARM CPU can’t be faster and more efficient than an x86 CPU. The M1 and the CPUs that will follow from Apple and Qualcomm represent the most potent competitive threat x86 has faced in the past 20 years. The ISA-centric viewpoint could prove true. But RISC versus CISC is a starting point for understanding the historical difference between two different types of CPU families, not the final word on how they compare today.

This argument is clearly going nowhere. Fights that kicked off when Cheers was the hottest thing on television tend to have a lot of staying power. But understanding its history hopefully helps explain why it’s a flawed lens for comparing CPUs in the modern era.

Note: I disagree with Engheim on the idea that the various RISC-like claims made by x86 manufacturers constitute a marketing ploy, but he’s written some excellent stories on various aspects of programming and CPU design. I recommend his work for more details on these topics.

Feature image by Intel.

Now Read:



sourse ExtremeTechExtremeTech https://ift.tt/3fChG2Q

ليست هناك تعليقات:

إرسال تعليق