Saturday 24 January 2004

IA64's Global Pointer

[Update 2004-01-24 13:14 GMT: Raymond emailed me regarding the reason for saving gp across translation units. Suggestions gratefully incorporated.]

Raymond Chen has been posting even more material about IA64 in the last few days. There's a lot of confusion over the role of the global pointer (gp, or r1) register. Some people have been comparing it to the segment registers as used in 16-bit x86 code.

I've been doing a bit of DOS programming during the last week (the project I referred to in the last post is cross-platform, running on DOS, Palm, desktop Windows in a console, and Windows CE), and I can safely say that I don't need to care about segments, particularly; the compiler's memory model sorts that out. You had to care with Windows 3.x, because the segments were a lot more visible.

The problem is relocatable code: how does a program (or, in general, a module - could be a library) find its global data if the address it loads at could change? One solution is to have a table of addresses that need to be adjusted if the load address does change, which is what Windows on x86 does. The downside of this is a large table of relocations (SHDOCVW.DLL, the web browser control, has 40KB of them in a 1.2MB binary) on the disk, a delay in loading as the table is processed, and more memory used as a lot of pages are edited if the binary is loaded at different addresses in different processes.

RISC processors have another problem - they typically use Very Long Instruction Words. A better name would be Uniform Instruction Word - all instructions are the same length - especially when referring to Hitachi SHx, ARM THUMB or MIPS16, where the instructions are 16 bits long. These instructions can only include a small amount of immediate data: for IA64, only 22 bits of immediate data (data specified as part of the instruction) can be used in the add-immediate-to-register instruction. To load a larger value, you must either load it using several instructions (avoiding any intra-register dependencies), or load it from memory. But where do you find it?

The solution used on most RISC processors is so-called PC-relative pool - where the data is stored in a location relative to the current Program Counter (or Instruction Pointer if you prefer). The code then doesn't need to change if the program is relocated - the data will still be located near the function (after the final return instruction, for ARM).

However, this has pretty poor data locality. Most processors have split instruction/data Translation Lookaside Buffers and separate Level 1 cache memory for instructions and for data. The PC-relative pools tend to be quite small, and scattered throughout the program, which leads to poorer use of the separate caches and TLBs. We'd prefer it if we could keep the data together. Also, thinking of security, we can't protect our code as well as we could do, because we have to give the whole page execute permissions. If the data needs to be writeable, we have to give the whole page write permissions too. If we write to the page, the OS must copy it and can no longer share it between processes.

A possible solution is to only store pointers to the data in the PC-relative pool, but then we've hit the relocations problem again - the pointers must be rewritten to take account of a relocation.

On architectures with a lot of registers, we could decide to use - by convention - one of those registers to point to the base of the module's data area. MIPS has 32 general purpose registers, and on Windows NT, used register $28 as a 'global pointer' (Windows CE does not use global pointers). IA64 has 128 general registers (32 of which are fixed, the other 96 act as a stack mechanism) and so r1 is reserved as a global pointer. This allows us to have a lot of data with no relocations, and keep it all in a big block.

There's a downside, though: whenever we call outside the current module, we have to store the current global pointer value, load the correct one for the called function, call the function and then restore gp when it returns. 'Module' here means the compilation unit; if you call a function in another source file, gp may need to be changed, because the compiler does not know whether the function might be imported from a DLL, with its own address space. Aside: in the case of a C++ non-virtual method call, it does know that the called method is in the same library if the class is not imported using __declspec(dllimport), for Visual C++. You can avoid the save/reload operations using Whole Program Optimization/Link-Time Code Generation, because the compiler knows the location of all entities when it does so.

This also has a side-effect on function pointers. Since function pointers are generally used at some distance from allocation, they might be used in a module with a different gp value. The compiler gets around this by not compiling a function pointer to a single pointer-sized value; it compiles to a pair of pointer-sized values, one representing the address of the first instruction (bundle on IA64) in the function, the other being the correct gp value to use.

There's no limit on the size of this global data area. However, only a (relatively) small amount of it can be addressed in a single pair of instructions: the 22 bits of the add instruction on IA64 limits the 'small' part of the static data area to 4MB. If any more is required, the compiler stores the address of the larger block in the small area, then compiles a double indirection. It also adds this address to the relocations table since it will need to be updated if the module is relocated. Again, it has to be quite conservative, so anything not declared static is likely to be put in the large area. Using static to indicate that a variable is only used in this module is good practice anyway (information-hiding), but is also important for performance. IA64 programs benefit quite a lot from link-time code generation, being able to put only the large items in the large data area.

An alternative strategy would have been to perform multiple adds, or shift-and-add, to gp to get a larger data area, but it doesn't get you much, and wastes a lot of instructions. The double-indirection is probably the right balance.

To be honest, few programs really have that much static data. More often, memory is allocated dynamically, with only the root pointer(s) of the data structure stored in a global variable, or on the stack. SHDOCVW.DLL only has 6.7KB of static data (it has over 540KB of resources). Word 2000's WINWORD.EXE (an 8MB executable) has 789KB of static data. If I had to have that much data, I'd store it as a binary resource or as a file and load it at runtime (which again would likely have better locality of reference).

It's important to note, therefore, that the use of the global pointer is an optimisation. It improves access to global data by removing the need for a large relocation table, improves locality of data references, and reduces loading overhead.

This is dramatically different from the DOS/Windows 3.1 case. Under DOS, the compiler usually takes care of the memory model, but you had to tell it which one to use. In 16-bit code, you only had two data segments to play with, and it was quite costly to change segments, requiring loading the right segment into the DS or ES register (having stored the old value on the stack) then, if using ES, using the segment override prefix, which added cycles to your code. You wanted to avoid the larger memory models if you could. Windows added another twist, that you often needed to know whether your data was near or far so you could make the right function call. Each segment was only 64KB in size, so you needed a lot of them for a complete program (the one I refer to above is a thin client, yet requires two code segments and two data segments for the whole program).

It's worth making the point that every register on IA64 can be used to point to any part of the whole 64-bit address space: the address space is flat. The gp register is just a convention and your compiler could ignore it, at least for internal calls. I believe the benefits outweigh the slight disadvantages.

1 comment:

Yuhong Bao said...

"In 16-bit code, you only had two data segments to play with"
Or 4 on a 386 or later.
"and it was quite costly to change segments"
Espicially in protected mode, where protection checks has to be done.
"requiring loading the right segment into the DS or ES register"
Or FS or GS, if on a 386 or later.
"(having stored the old value on the stack) then, if using ES,"
FS or GS,
"using the segment override prefix, which added cycles to your code. "