Sunday, 12 June 2005

Mac & PC - will the performance comparisons end?

From Paul Thurrott: a link to Java Rants, “Will partnering with Intel give Apple a Mac faster than a PC?


Let me expand. The theory goes that the processor is laden down with years of backwards compatibility that costs performance – starting afresh with a special customised ‘x86’ with fewer instructions, without this backward compatibility, could improve performance. Apple’s ‘special’ x86s could then beat ‘regular’ x86s.

The trouble with this argument is that it’s bullshit.

The Pentium 4 does feature compatibility right back to the 8086. If the platform support is right, you can, it is believed, boot the original MS-DOS 1.0. But that doesn’t slow it down.

It’s correct that the core of the P4 is RISC-like, but only in the sense that the aim of RISC was to have very simple instructions that could be decoded by simple logic circuits which would execute on the core in one cycle. Modern ‘RISC’ processors are the same, effectively, as the core of the P4 – there are multiple execution units that can execute operations concurrently, in the same cycle. On the P4 there are two Arithmetic-Logic Units [ALUs] which can actually perform two operations per cycle – one on the first half of the cycle, the other on the second. The key is in how the instructions are decoded.

Simple x86 instructions are decoded pretty much as a true RISC processor would – using logic to directly decode an x86 instruction into a core-compatible micro-operation [µop]. The resulting trace goes in the trace cache – the P4 attempts to only decode a stream of instructions once, performing loops directly from translated instructions. Anything more difficult than this – say, indexed indirect memory loads with autoincrement – goes to a microcode ROM which contains a ‘program’ of µops to implement that x86 instruction. If the microcode program is greater than 4 µops in size, the execution core has to execute directly from the microcode ROM rather than the trace cache. I think that pretty much all of the clever out-of-order stuff is suppressed when this occurs. Hit a large instruction from ROM and the processor slows to a crawl.

Anyway, after that brief technical interlude, let me explain why removing backwards compatibility won’t speed up the processor. Because most of it’s implemented in the microcode ROM. I don’t know if you recall when the P4 first came out. A lot of commentators, running their tests on Windows 98, suggested that the P4 was actually slower clock-for-clock than the PIII. Guess why? Partly that code was optimised for the PIII which had different characteristics from the P4 – but also that Windows 98 still contained a lot of 16–bit code, which hit the microcode ROM.

If you don’t use the expensive instructions, you don’t incur any cost, particularly, for them being there. Presumably there’s a little more complexity in the decoder logic to determine that the instructions are in microcode.

People often make the same mistake with respect to Windows XP. They think that the existence of the DOS and Win16–compatible subsystems slows it down. Nope. That code lives in NTVDM.EXE and WOWEXEC.EXE, which aren’t even loaded unless they’re being used. I also need to be clear that the console window is not DOS. cmd.exe, the Command Prompt, is a 32–bit Windows application that uses the console subsystem. The difference between a console app and a Windows app is that a console app has a console created for it if its parent was not running in a console. What is a console? It’s a window which is created and updated by the CSRSS process. There’s some code in there to handle back-compat graphics but again this almost certainly isn’t loaded unless it’s being used. A lot of this has been ditched for Windows x64, but that’s because the processor doesn’t support virtual-8086 mode, necessary for this support, in so-called Long Mode (64–bit mode).

cmd.exe supports (I believe) all of the DOS command interpreter,’s feature set, and extends it greatly. This, plus the command-line environment, leads a lot of people to be confused. Particularly when some of the system utilities, like more, exist as files named If you look with dumpbin or depends, you’ll see that is in fact a Win32 console executable.

Speaking of x64, I note that Apple aren’t going straight for x64. Their Universal Binary Programming Guidelines [PDF, 1.5MB] talk about the 32–bit register set, not the 64–bit extended set. I expect that this decision is because the Pentium M with EM64T – codename ‘Merom’ – isn’t due out until late 2006.

This whole argument presupposes that Intel will make Mac-specific processors. I think that’s highly unlikely. Apple haven’t told us their motivations, leading to a lot of speculation. My view is that they are trying to both save money, by using a more commodity processor, and enable quicker access to newer technologies, by using more commodity chipsets (oh, and save money). It took them years to get AGP in Mac hardware, and it looked like taking even more years to get PCI Express. I don’t therefore expect Intel to produce ‘special’ processors for Apple – it’ll be stock parts or nothing.

Another reason for using stock parts is so that stock software tools can be used. Apple can mostly wave goodbye to maintaining their own compiler. It’s no accident that their guidelines state they’re using the System V i386 Application Binary Interface [PDF, 1.0MB], with only minor modifications. Of course, that means that to begin with, Apple will be using the very compiler – GCC – which they used to ‘prove’ that PowerPC G5s outperformed P4s, in that contentious set of benchmarks. Unless they’ve improved the optimiser, or learned how to use it properly, ‘PCs’ may well still outperform Macs.

One tiny area in which Macs may have a temporary advantage is in exception handling. Today, on 32–bit Windows, each use of an exception handler pushes a handler frame onto the stack, a cost incurred whether or not an exception actually occurs. I believe GCC uses a table-based scheme, which in general incurs that cost only when an exception occurs. However, Windows on x64 – indeed on every architecture other than 32–bit x86 – also uses a table-based scheme.


asm86 said...

Very good article, if you don't mind I'd like to link to it from my x86 architecture blog. Thanks!

Vareck said...

A trace isn't a decoded uop, though traces are made up of uops in the intel implementation. It is a sequence of instructions independent of location in instruction memory with branches collapsed. A mispredicted branch invalidates the trace. Traces generally have a single entry point and a single exit point and are fetched all at once, rather than the traditional instruction at a time or line at a time.

Anonymous said...

Hi Mike,

You have a very cool blog here…loved the content.
U know there is an awesome opportunity for people like you who have ur own blogs n sites…I came across this site called…it’s a platform for people to buy and sell IT related information. and everytime you sell some information you get paid for it…Good money for people like us in the IT domain. Here the link

Sign up is free…check it out…
You can contact me at my id here for more questions :