Mike Dimmick's Bleurgh: January 2004

Friday, 30 January 2004

Testing BlogJet

Test post with BlogJet. I might use this in future if it turns out OK.

Monday, 26 January 2004

Microsoft Watch: Microsoft Antitrust Compliance Assessed

I've already said in a number of blog comments that I'm surprised at the number of companies who have licensed Microsoft's communications protocols.

Surprised that the number is so high, that is. Most of the protocols are only of use to anyone producing systems management software intended to replace the mechanisms already in Windows. Alternatively, you can obtain some of the extended authentication protocols used by Windows in dial-up networking and network file systems. There are some details on remote procedure calling and implementing the DCOM protocol. You get the idea.

Very little of this is of any use to the typical ISV. We are, for the most part, interested only in using these services, not reimplementing them.

The prevailing attitude among systems companies such as IBM is to use Open Source software to provide interoperation with Windows platforms, such as Samba. Open Source is incompatible with the licence terms of the Communications Protocol Program - the licensee is only allowed to disclose source code on their own premises, subject to some confidentiality clauses. Otherwise, systems companies are only interested in making Microsoft follow their standard, not vice versa. No other operating system has ever implemented COM, disregarding a Microsoft-sponsored port to Solaris.

Saturday, 24 January 2004

Site news

Blogger have recently added support for the ATOM subscription format, and it's free. Subscribe. You may need to upgrade your blog viewer to view this, because ATOM is very new. NewsGator 2.0 supports it.
Name change. In case people follow links I post on other blogs, I've made the identity a little clearer.

Last post on IA64, I promise

At least for tonight, anyway. Here's a question for Chris Brumme:

How does the CLR's Just-In-Time compiler decide where to put an architectural stop in the generated IA64 instruction stream? The original sequence points in the source code aren't emitted to the IL, as far as I can tell.

IA64's Global Pointer

[Update 2004-01-24 13:14 GMT: Raymond emailed me regarding the reason for saving gp across translation units. Suggestions gratefully incorporated.]

Raymond Chen has been posting even more material about IA64 in the last few days. There's a lot of confusion over the role of the global pointer (gp, or r1) register. Some people have been comparing it to the segment registers as used in 16-bit x86 code.

I've been doing a bit of DOS programming during the last week (the project I referred to in the last post is cross-platform, running on DOS, Palm, desktop Windows in a console, and Windows CE), and I can safely say that I don't need to care about segments, particularly; the compiler's memory model sorts that out. You had to care with Windows 3.x, because the segments were a lot more visible.

The problem is relocatable code: how does a program (or, in general, a module - could be a library) find its global data if the address it loads at could change? One solution is to have a table of addresses that need to be adjusted if the load address does change, which is what Windows on x86 does. The downside of this is a large table of relocations (SHDOCVW.DLL, the web browser control, has 40KB of them in a 1.2MB binary) on the disk, a delay in loading as the table is processed, and more memory used as a lot of pages are edited if the binary is loaded at different addresses in different processes.

RISC processors have another problem - they typically use Very Long Instruction Words. A better name would be Uniform Instruction Word - all instructions are the same length - especially when referring to Hitachi SHx, ARM THUMB or MIPS16, where the instructions are 16 bits long. These instructions can only include a small amount of immediate data: for IA64, only 22 bits of immediate data (data specified as part of the instruction) can be used in the add-immediate-to-register instruction. To load a larger value, you must either load it using several instructions (avoiding any intra-register dependencies), or load it from memory. But where do you find it?

The solution used on most RISC processors is so-called PC-relative pool - where the data is stored in a location relative to the current Program Counter (or Instruction Pointer if you prefer). The code then doesn't need to change if the program is relocated - the data will still be located near the function (after the final return instruction, for ARM).

However, this has pretty poor data locality. Most processors have split instruction/data Translation Lookaside Buffers and separate Level 1 cache memory for instructions and for data. The PC-relative pools tend to be quite small, and scattered throughout the program, which leads to poorer use of the separate caches and TLBs. We'd prefer it if we could keep the data together. Also, thinking of security, we can't protect our code as well as we could do, because we have to give the whole page execute permissions. If the data needs to be writeable, we have to give the whole page write permissions too. If we write to the page, the OS must copy it and can no longer share it between processes.

A possible solution is to only store pointers to the data in the PC-relative pool, but then we've hit the relocations problem again - the pointers must be rewritten to take account of a relocation.

On architectures with a lot of registers, we could decide to use - by convention - one of those registers to point to the base of the module's data area. MIPS has 32 general purpose registers, and on Windows NT, used register $28 as a 'global pointer' (Windows CE does not use global pointers). IA64 has 128 general registers (32 of which are fixed, the other 96 act as a stack mechanism) and so r1 is reserved as a global pointer. This allows us to have a lot of data with no relocations, and keep it all in a big block.

There's a downside, though: whenever we call outside the current module, we have to store the current global pointer value, load the correct one for the called function, call the function and then restore gp when it returns. 'Module' here means the compilation unit; if you call a function in another source file, gp may need to be changed, because the compiler does not know whether the function might be imported from a DLL, with its own address space. Aside: in the case of a C++ non-virtual method call, it does know that the called method is in the same library if the class is not imported using __declspec(dllimport), for Visual C++. You can avoid the save/reload operations using Whole Program Optimization/Link-Time Code Generation, because the compiler knows the location of all entities when it does so.

This also has a side-effect on function pointers. Since function pointers are generally used at some distance from allocation, they might be used in a module with a different gp value. The compiler gets around this by not compiling a function pointer to a single pointer-sized value; it compiles to a pair of pointer-sized values, one representing the address of the first instruction (bundle on IA64) in the function, the other being the correct gp value to use.

There's no limit on the size of this global data area. However, only a (relatively) small amount of it can be addressed in a single pair of instructions: the 22 bits of the add instruction on IA64 limits the 'small' part of the static data area to 4MB. If any more is required, the compiler stores the address of the larger block in the small area, then compiles a double indirection. It also adds this address to the relocations table since it will need to be updated if the module is relocated. Again, it has to be quite conservative, so anything not declared static is likely to be put in the large area. Using static to indicate that a variable is only used in this module is good practice anyway (information-hiding), but is also important for performance. IA64 programs benefit quite a lot from link-time code generation, being able to put only the large items in the large data area.

An alternative strategy would have been to perform multiple adds, or shift-and-add, to gp to get a larger data area, but it doesn't get you much, and wastes a lot of instructions. The double-indirection is probably the right balance.

To be honest, few programs really have that much static data. More often, memory is allocated dynamically, with only the root pointer(s) of the data structure stored in a global variable, or on the stack. SHDOCVW.DLL only has 6.7KB of static data (it has over 540KB of resources). Word 2000's WINWORD.EXE (an 8MB executable) has 789KB of static data. If I had to have that much data, I'd store it as a binary resource or as a file and load it at runtime (which again would likely have better locality of reference).

It's important to note, therefore, that the use of the global pointer is an optimisation. It improves access to global data by removing the need for a large relocation table, improves locality of data references, and reduces loading overhead.

This is dramatically different from the DOS/Windows 3.1 case. Under DOS, the compiler usually takes care of the memory model, but you had to tell it which one to use. In 16-bit code, you only had two data segments to play with, and it was quite costly to change segments, requiring loading the right segment into the DS or ES register (having stored the old value on the stack) then, if using ES, using the segment override prefix, which added cycles to your code. You wanted to avoid the larger memory models if you could. Windows added another twist, that you often needed to know whether your data was near or far so you could make the right function call. Each segment was only 64KB in size, so you needed a lot of them for a complete program (the one I refer to above is a thin client, yet requires two code segments and two data segments for the whole program).

It's worth making the point that every register on IA64 can be used to point to any part of the whole 64-bit address space: the address space is flat. The gp register is just a convention and your compiler could ignore it, at least for internal calls. I believe the benefits outweigh the slight disadvantages.

Friday, 23 January 2004

More IA64

This afternoon, I was feeling bored, so decided to get some practical IA64 tutorial in: I exported a makefile for the project I was working on, changed a couple of things (notably switching /MACHINE to IA64 from I386) and turned on /FAs. This allows me to read the assembly of code I'm already familiar with. I don't have an IA64, so I have to be my own processor while reading it. My brain is getting in quite a tangle from the compiler pushing quite a few operations up the instruction stream from where they were requested in the source (which it's permitted to do in C++ so long as the external view is that the operations happened in this order). It also does quite a bit of speculative execution, computing results that may never be needed.

Of course, this is different from modern x86 processors, which take the x86 instruction stream, convert it to smaller RISC-like operations, schedule instructions out of order with speculative execution, then work out how to recover the x86 state from that. However, a large proportion of the processor is simply made up of the translators, out-of-order schedulers, and instruction retire logic - larger than the portion that actually performs the computations. IA64 offloads all this work onto the compiler - the core itself is a simple in-order execution engine. The instruction stream explicitly tells the processor which instructions can execute in parallel, and which depend upon each other.

Comparing CLR, Mono, SSCLI and JAVA Performance

Eric Gunnerson links to a very old post by Werner Vogels titled Comparing CLR, Mono, SSCLI and JAVA Performance.

I'm sure I've seen that before. Maybe a different blogger (probably the prolific Robert Scoble) linked to it.

Anyway, it's encouraging that .NET is doing so well relative to Java considering that Java has had about six years more development. SSCLI is of course intended as a research implementation that's easier to understand, rather than a high-powered full implementation (and I suppose MS wouldn't want to give away too many secrets, either).

The exception-handling performance is a little concerning. Does the CLR really have to raise a Win32 structured exception to handle a managed exception, carrying with it a massive amount of overhead? IIRC, the Win32 function RaiseException is largely implemented in kernel mode, and requires a user-to-kernel-to-user mode transition to call each filter function for each of two passes. Ever since reading Chris Brumme's entry on the exception model, I've wondered whether it couldn't defer this operation until it actually hits some unmanaged code.

On XP and Server 2003, it could perhaps insert a vectored exception handler instead.

Wednesday, 21 January 2004

Operating System Sucks-Rules-O-Meter

In case this has any effect, 'Linux Sucks'.

We now return to your scheduled programming...

Monday, 19 January 2004

Funniest error page ever

Gotta love CP's error page:

Sunday, 18 January 2004

Real need a good kicking

I'm not sure who to be more pissed at, Real or Microsoft.

I was trying to play an ordinary MP3 from a website a short while ago. I clicked the link, bloody Real Player pops up. Dammit, I thought, I never asked for Real Player, I detest the thing, but another user on this computer installed it.

OK, never mind, change the association - hang on, where's the option in Media Player? There's supposed to be a File Types tab - and it's missing!

OK, let's look at 'Set Program Access and Defaults' - nope, can't do that, I'm not an admin in this account.

(Swears loudly)

Log on as my admin account, make my normal account an administrator, log off admin account, log on in normal account. Oh look, File Types is back. Use SPAD anyway to fix anything else Real has broken. Now links play in WMP, like they ought to.

Note to self: become a Limited User again later

So let's summarize, shall we:

Media Player should show its File Types tab to all users, no exceptions.
Set Program Access and Defaults should be accessible to all users - maybe it shouldn't show the 'show this program' checkbox, but I should be able to change my damned default player!
Real should not assume that all users want their crap if one user installs it.

To be honest, it's not entirely Real's fault: HKEY_CLASSES_ROOT in Windows is a sort of virtual registry key, where per-user settings override per-machine ones. Somehow you have to propogate default settings, including file type associations, for newly installed programs to individual users. I'd prefer if Windows checked during logon and made a copy of any newly installed settings, but without overwriting anything I was already using. Question: how do you determine what was in use?

Wednesday, 14 January 2004

On AMD64 exception handling

Raymond posted an article on AMD64 calling conventions.

In the comments, I wrote:

I assume that using a single subtraction to adjust the stack for the whole duration of the function - including function call parameters - simplifies the exception unwind procedure.

Context: SEH exceptions on AMD64 (for 64-bit programs) are table-based, NOT based on an exception handler chain at fs:[0] as on x86. Raymond, any idea why x86 is the only architecture which uses this frame-based exception handler chain?

Also I note that table-based exception handlers can't be exploited by overwriting the handler on the stack, because they're not on the stack.

Coming from a Windows CE background of table-based handlers, it seems odd that the unwind table contains descriptions of the function of each operation performed in the function prologue. I suppose this allows the unwind code in the OS to be a little more generic (applicable to many architectures). Windows CE just interprets the instruction stream for the prologue, executing it backwards (i.e. performing the reverse meaning of each instruction, in reverse order - if the prologue says push register A, then subtract 20 from register B, the unwind code adds 20 to B then pops A).

The desktop approach does allow the unwind code to be interpreted forwards, but adds to the size of the executable (and probably the working set when unwinding the call). However, the CE approach may cause parts of the executable to be paged in solely so that the stack can be unwound. Horses for courses, I suppose - it's more important to preserve memory on CE, while fast code is more important on the desktop.

[More information on AMD64 calling conventions]

Friday, 9 January 2004

Habits of ineffective people

It's been a long week.

It shouldn't have been - I've had plenty of work to do - but somehow I've just been virtually unable to actually do any of it. OK, I've just gone back to work after three weeks off, but that hasn't been a problem before.

Monday and Tuesday were spent reinstalling after last Friday's debacle. Once you've got a system that mostly works, though, the horrible temptations are all back - reading email, news sites, forums, downloading software, etc. It can be difficult to remember that there's an install happening in the background - I think VS.NET 2003 was waiting for disc 2 for about an hour...

Meanwhile I seemed to get more done on my own PC at home than at work. Maybe this is a sign that my work environment just isn't effective at the moment - could be time for a change.

Then, this afternoon, I found Joel's article on Fire And Motion. Nice to know I'm not alone, but I can't let it go on.

Saturday, 3 January 2004

Sigh...

Looks like I need to write my own posting plug-in for NewsGator: the one that you can download from NewsGator.com ignores the <title> element.

Don't store data you need later in a place you're about to destroy

I made - or rather, encountered - a bit of a howler at work yesterday.

I was trying to resize the clusters on my hard disk with PartitionMagic 8.0. The system is not exactly the fastest in the world (an 850MHz Duron, which only has 64KB of cache, and a hard disk I can only describe as glacial) - if you try right-clicking the task bar when you've not done this for a while, it can take over ten seconds to bring up the pop-up menu. Windows 2000 installs with a default cluster size of 512 bytes on an NTFS partition (for a reason I'll describe shortly), which requires 8 separate reads to fill a single memory page (4KB on x86 systems). I'm not sure if the disk driver subsystem can actually do a multiblock transfer for this or not.

Anyway, PM indicated it would need to reboot to resize the clusters on the system partition. Where did it place the temporary file of commands? In C:\WINNT\TEMP. That's right, on the partition to be converted, along with the program itself.

So the conversion proceeded perfectly, right up to 100% - at which point, it generated a number of errors (which I should have written down), and said 'Press any key to reboot'. OK, reboot - nothing.

Great.

So I boot up with the PM Rescue disks I've created, check partition for errors: brings up a number of errors, none of which it can fix (starting with 'MFT Mirror in wrong place' or some such). OK, let's try CHKDSK: put Windows 2000 CD in drive, reset BIOS settings to boot from CD, go to Recovery Console. Run CHKDSK /P because the disk is marked clean.

To cut a long story short [too late - Ed] CHKDSK took all afternoon and got stuck on 50%. After leaving it for about an hour, I rebooted and tried PM Rescue again. More errors, this time fatal. I examined the drive information, and discovered that the cluster size was still set to 512 bytes in the filesystem metadata. I assume that PM had managed to do all the conversion, apart from the critical step of updating the cluster size. Thanks a bunch, guys. This probably means that CHKDSK in fact corrupted the whole disk, rather than fixing the problem. I figure that PM failed because it either couldn't find the batch script or that it couldn't find the programs any more (duh! you just rewrote all the metadata for the disk!).

Of course, I didn't take a backup, but I'd concluded that a) there wasn't anywhere I could write 20GB of stuff and b) most of it was programs - 3 editions of Visual Studio takes up a whole load of space. Everything that was actually work-related was already backed up to the network.

So I guess I'll see if I can persuade my boss to let me have Windows XP now.

Oh yes: why does Windows 2000 install to a 512 byte cluster size? If you create a new partition for installing and select Format using NTFS, it doesn't. It actually formats the disk as FAT, installs on that, then converts the disk and applies security settings. The FAT-to-NTFS converter ignores the cluster size configured for the FAT partition; it always uses the disk's sector size, normally 512 bytes. All versions of NT up to Windows XP did this, which is why you can't create a system partition greater than 2GB in Windows NT 4.0 setup - FAT16 didn't support it.

Windows XP and Server 2003 do it properly - they actually do format as NTFS to start with. I assume that the reason for the FAT installation was that the NTFS version of format couldn't be fitted onto the NT 4.0 boot disks, and it didn't get changed for Windows 2000 for whatever reason.

Mike Dimmick's Bleurgh