Sunday, 29 February 2004

Future of HTML

Inspired by Dare: A Look at the xml:base attribute and the .NET Framework's XmlReader

Let's start with a bald statement: I believe that the web will continue to use the current lingua franca of page description, HTML 4.01. This is the zenith of the HTML series of standards: it describes the use of styles to provide separation between document model and appearance, and standardises the use of plug-in objects.

I see XHTML 1.0 and later as being a solution in search of a problem. XHTML 1.0 is a reformulation of HTML using the stricter XML model, which should allow a standard XML parser to parse XHTML successfully.

Unfortunately, XHTML is strictly incompatible with HTML 4.0. The problem stems from the requirement that all elements in XML are closed, whereas HTML does not require this. XML offers a simpler syntax for elements with no content, e.g. <br />. If HTML is interpreted strictly, that / is illegal.

The intent of XHTML is to split HTML down into modules, which can be implemented as required by a browser. The unfortunate part is of course that large swathes of the existing Web already contain elements missing from, for example, the XHTML Basic profile. To remain usable for a given user, the user's browser must implement all of HTML 4.0 - making XHTML basically pointless.

Of course, IE has problems with XHTML anyway. The Jargon File renders strangely on IE due to mismatched character-set information. The server doesn't supply any character set information: the HTTP headers only indicate Content-Type: text/html. The file I linked to is formulated as XHTML (and shouldn't be transmitted as text/html anyway); the <?xml?> processing instruction indicates encoding="UTF-8". IE uses its default character set, Windows-1252, to display the data, leading to the wrong result. It does this because the HTTP header didn't indicate a character set. IE also goes into Quirks mode, because there isn't a valid HTML 4.0 DTD.

Friday, 27 February 2004

Alpha vs Beta test

From Paul Thurrott's WinInfo Short Takes:

"First, Longhorn, the next major version of Windows, is currently in pre-beta (which used to be called alpha, but Microsoft actually refers to as "pre-alpha," which doesn't make sense) [...]"

Note: this article is an expanded version of my comment on WinInfo.

Pre-alpha versus alpha versus beta versus pre-beta: terms often misused. Alpha testing is when you are feature-complete (you've implemented all the features you intend to ship) and are testing internally. Beta releases are once you've performed a lot of internal testing and resolved as many issues as you think customers will encounter. Asking customers to perform their daily work on the software can reveal problems that weren't found in the fairly sterile environment of the alpha test lab. It also allows you to get feedback on how easy or difficult to use the software actually is.

I've not really heard of pre-beta, but I suppose it could refer to an intermediate stage between alpha and beta test (for example, polishing installers - your in-house testers can probably cope with some more crufty installers than the users could).

MS seem to be changing a lot more between beta releases these days: some of the code is still alpha quality, or features are appearing in later betas that weren't present in earlier ones.

Longhorn is still a long way from design complete, let alone feature complete. It can only be said to be 'pre-alpha'. Build 4051 was just a dump to allow developers to get an early look at current thinking - call it a warning shot, if you like.

See also Joel: Picking a Ship Date.

Reference-based languages

Call me an unreconstructed C zealot ;)

Thanks.

You still have to understand pointers in a reference-based language such as VB(.NET), C# or Java. You just don't see them, and have to remember whether a name means a value or a reference.

You also have to know the difference between ByVal and ByRef in VB, otherwise it bites you when a change to a parameter made inside a method gets reflected up to the caller. Remember that the default for a parameter in VB6 is ByRef, which can cost you more for the simple types.

VB6 also has the braindead convention of being able to put parentheses around a parameter to indicate that it's passed by value where it would normally be a reference (actually, I think it's a side-effect - it gets evaluated as an expression, and a reference to a temporary is passed that's thrown away on return),

I'm not that sure that the mental overhead is worth it.

(Inspired by this Ask Joel thread).

Thursday, 26 February 2004

Careful when using udp/1434

I had an interesting problem this afternoon: a client program using a UDP transport was having trouble communicating with its server. The client doesn't call bind() unless you specify a particular port, so it gets a dynamic port (which Windows allocates from 1025). It turned out it was using UDP port 1434...

...which is, by convention, SQL Server's port. It appears that Enterprise Manager sends (as I discovered a couple of days ago) packets to this port to discover if the server is alive, for all registered servers. The client was running on our test box, for which we use the same IP address regardless of what's installed. I'm guessing (but didn't confirm) that a colleague still has a registration for that server.

So the client was getting a single byte just often enough to assume that the server has responded (for reasons I can't go into here, we don't connect() to the server and hence only accept packets from a particular end-point). It was processing the response as 'response received,' but then discarding it because it was too short. Result: going round and round in a retry loop.

The key point: always verify as much of your protocol as possible. Put some magic numbers in it, if you're designing a binary protocol. Reject anything that looks even vaguely wrong.

Some people write code that is too loose, on the misguided assumption that it makes the software more flexible. It does, but it also means that your program may react badly to being sent something it's not expecting. At worst this can lead to security holes.

Wednesday, 25 February 2004

More character set stuff

I see that MSDN is still having character set trouble: it looks like pages are being encoded with UTF-8, then the encoded page is again passed through UTF-8.

The latest example is in the XP SP2 Windows Firewall information (see 'allow local' near the end of the document). The UTF-8 sequence e2 80 9c (which shows in the document as “) is U+201C, the opening double-quote character -> “

The safest way in XML and HTML is to use &# notation (e.g. &#x20AC; is the Euro symbol, €). HTML 3.2 indicated that these were to be interpreted as ISO Latin-1, whereas HTML 4.0, XML 1.0 and later interpret them as Unicode. The named character references (e.g. &ldquo; for “) only work properly in HTML, not in XML documents. XML processors are required to recognise &lt;, &gt;, &amp;, &apos; and &quot; - < > & ' ", respectively (see section 4.6 of the XML 1.0 specification [link to Tim Bray's annotated version; definitive version at www.w3.org]).

Tuesday, 24 February 2004

'I have an antivirus program that I'll monitor myself'

Dear MS:

With regard to the XP SP2 Security Center, specifically the dialog depicted here. Please could you add a checkbox labelled:

I am not an idiot, and do not require an anti-virus program

Thanks.

Seriously, I believe that anyone practising safe computing does not need an anti-virus program. The rules basically boil down to:

  • Don't run file attachments to emails.
  • Don't open emails from people you don't recognise
  • Keep up to date with security patches
  • Be wary of what you download

Follow these and you'll be fine.

I was a little concerned earlier today when our router logs indicated that my computer was sending arbitrary data over UDP to an external port 1434, which is SQL Server's well-known service discovery port. It turned out to be Enterprise Manager trying to discover whether a colleague's server was running; he doesn't register with our DNS server, whereas I use that as the primary DNS (we have a partly implemented domain) and instead of finding his computer name, it found computername.co.uk. Thanks, TDImon!

Bet they were pleased to get udp/1434 traffic...

On making tea

Follow-up to the methodology of making tea (jeffdav) - milk in first, or afterwards?

If you're doing it properly, you should be making the tea in a pot first, pouring the milk into the cup or mug, then pouring the tea from the pot into the cup once it's brewed.

I'm quite lazy at work and pour hot water directly onto a teabag in the mug, wait about half a minute for it to brew, remove the teabag then add the milk. Milk + teabag doesn't go together well, IMO.

You can buy two sorts of teabags in the UK: regular and 'one cup'. The difference is basically that 'regular' is stronger, and tends to be a blend of teas. Since 'one cup' teabags are nearly as, if not more, expensive compared to regular ones, a lot of people simply use regular bags for making single cups and whip the bag out sooner.

Character sets

While on the subject of character encoding, I had a big followup which I was going to send to John Robbins about his Bugslayer column in this month's MSDN Magazine. However, I drifted off the point a bit, and it would make a better blog entry.

My basic point relevant to the article is that there's no such thing as 'Greek ASCII' (apart from "it's all Greek to me," of course). ASCII was standardised as ISO 646-US. It's a 7-bit only character set; only code points 0 - 127 (decimal) are defined. The meaning of any other bits (ASCII was defined before standardisation - here de facto, not de jure - on an 8-bit byte) is up to the implementation. There are seven other variations, the simplest being 646-UK, which only swaps £ in for # at code-point 35.

The Danish/Norwegian, German and Swedish forms cause havoc for C programmers, because characters essential for writing C programs (e.g. {}, [], \, |) are replaced (relative to the -US set) with accented vowel characters. C partly gets around this using <iso646.h>, which defines a number of alternate names (macros) for some of the operators that are really messed up by this. C also has trigraph support (officially, although few compilers support it), where these characters can be produced by using ?? and another character (e.g. ??/ => \). C++ also has some digraphs which are easier to type and remember than the trigraphs, but are more limited. Officially, the iso646.h names are now keywords in their own right.

The irony of this is that for the most part, very few people now need the trigraphs, digraphs or alternate keywords, because almost everyone is now at least 8-bit capable. The de jure standards for 8-bit character sets - at least, European ones - are the ISO 8859 series, including the well-known ISO 8859-1 suitable for Western European languages. The de facto standard is of course Windows-1252, which defines characters in a region between code points 128 and 159 which 8859 marks as unused (and IANA's iso-8859-1 reserves for control characters). 8859 uses 646-US for the first 128 code points. This often causes havoc on the Web, where many documents are marked as iso-8859-1 but contain windows-1252 characters (although this is usually the least of the problems).

8859 is a single-byte character set: a single 8-bit byte defines each character. This doesn't give nearly enough range for Far East character sets, which use double- or multi-byte character sets. An MBCS character set (such as Shift-JIS) reserves some positions as lead bytes, which don't directly encode a character, they act as a shift for the following or trail bytes. Unfortunately, the trail bytes aren't a distinct set from the lead and single bytes. This gives rise to the classic Microsoft interview question: if you're at an arbitrary position in a string, how do you move backwards one character?

For some reason best known to themselves, all byte-oriented character encodings are known in Windows as 'ANSI', except those designed by IBM for the PC, which are known as 'OEM'. If you see 'ANSI code-page', think byte-oriented.

Frankly this is a bit of a nightmare, and a rationalisation was called for. Enter Unicode (or ISO 10646). Now, a lot of programmers seem to believe that Unicode is an answer to all of this, and only ever maps to 16-bit quantities. Unfortunately, once you get outside the Basic Multilingual Plane, you can get Unicode code points that are above U+FFFF. It's better to think of Unicode code points as being a bit abstract; you use an encoding to actually represent Unicode character data in memory. The encoding that Windows calls 'Unicode' is UTF-16, little-endian. This serves to confuse programmers. Actually, versions of Windows before XP used UCS-2, i.e. they didn't understand UTF-16 surrogates, which are used to encode code points above U+FFFF. Again, for backwards compatibility (at least at a programmer level), the first 256 code points of Unicode are identical to ISO 8859-1 (including the C1 controls defined by IANA).

You may have heard of UTF-8. This is a byte-oriented encoding of Unicode. Characters below U+0080 are specified with a single byte; otherwise, a combination of lead and trail bytes are used. This means that good old ASCII can be used directly.

Hang on, that sounds familiar... The difference with UTF-8 is that the characters form distinct subsets; you can tell whether a given byte represents a single code point, a lead byte, a trail byte, and if it's a lead byte, how many trail bytes follow. UTF-16 has the same property; the term surrogate pair is used because there can only be two code words for a code point. UTF-16 can't encode anything after U+10FFFF because of this limitation. This makes it possible to walk backwards, although everyone who has --pwc in their loops has a potential problem.

UTF-8 is more practical than UTF-16 for Western scripts, but any advantage it has is quickly wiped out for non-Western scripts. The Chinese symbol for water (?) at code-point U+6C34 becomes the sequence e6 b0 b4 in UTF-8 - 3 bytes compared to UTF-16's 2. Its main advantage is that byte-oriented character manipulation code can be used with no changes. Recent *nix APIs largely use UTF-8; Windows NT-based systems use UTF-16LE.

The .NET Framework also uses UTF-16 as the internal character encoding of the System.String type, which is exposed by the Chars property. System.Text.ASCIIEncoding does exactly what it says on the tin: converts to ISO 646-US. Anything outside the 7-bit ASCII range is converted to the default character, ?. The unmanaged WideCharToMultiByte API (thank god it's not called UnicodeToAnsi) allows you to specify the default character, but as far as I can see Encoding does not. GetEncoding( 1253 ) will get you a wrapper for Windows-1253, not 'Greek ASCII'.

Monday, 23 February 2004

BlogJet got cooler

BlogJet now has a Code tab, so I can go and tinker with the few bits it gets wrong and add strikethrough.

Recently there was a débacle over débacle, where it had inserted - IIRC - UTF-8 codes which weren't. Didn't render properly in the browser. Let's see if that's fixed...

Nope. The blog is marked as UTF-8, but the é characters appear literally in the code using their Windows-1252 forms. Lessee now, what magic incantation is it... é is in the Unicode Latin-1 Supplement section, so its code is U+00E9, or &#233; to HTML types.

I don't use &eacute; because it tends to break XML, and of course my ATOM feed is an XML document.

Darn. Doesn't have Find-and-Replace in Code view. Ah well, should be easier to copy the code into TextPad and back into BlogJet after extra editing. Couple of stray &nbsp;s too.

*crush* *kill* *destroy*

I definitely stand by an earlier post (at least, I think I've posted this before): Norton Internet Security must die!

Sadly I don't own this computer, or I would be ceremoniously scrubbing the areas of the disk which it occupied and burning the box.

The damn thing is a little child - it needs to be constantly patted on the head and reassured that it's doing the right thing. No, I don't want arbitrary eDonkey users trying to connect to this computer on port 4662 - not that I care, since I'm not running eDonkey, or anything else on that port.

If I say Block, I damn well mean it.

Friday, 20 February 2004

COM Marshalling on CE

Raymond Chen: Why do I get a QueryInterface(IID_IMarshal) and then nothing?

I had this problem with some CE ActiveX controls originally developed for Pocket PC, when running them on a custom platform (PDT7200, if memory serves). The reason? There are two versions of COM for Windows CE platforms, both of which ship with Platform Builder. The simple version supports in-process, MTA components only: on this version, you're only allowed to pass COINIT_MULTITHREADED to CoInitializeEx. This version doesn't support any marshalling; if your component needed it, that's your problem. You have to make your own way back to the UI thread if you need to fire an event.

The other version is known as DCOM in Platform Builder, and supports a pretty complete implementation of Distributed COM, allowing out-of-process components. This includes STAs and marshalling.

The Pocket PC uses the simple-COM implementation (at least, up to Pocket PC 2002 it does, I don't know about Windows Mobile 2003), so my (unmarked) component was fine. The PDT7200 uses the DCOM component, my component was unmarked, so it tried to look for a marshaller - and failed, because the type library wasn't registered.

Tuesday, 17 February 2004

Archaic computer hour

Dammit! BlogJet's Post button looks too much like the Link button

Rory had an 81...

I'm celebrating (I believe) my 20th year of programming this year: we got our first Spectrum in 1984, I think. Yes, he of the rubber keys and the blocky colour graphics.

Over the years we had two Spectrums (Spectra?), a couple of Interface 1 and some Microdrives. Actually, when we bought the second Spectrum second-hand, it was a freebie with the Interface 1 we actually wanted - my dad had written a planning applications database where all the data was stored on the Microdrive. The ROM on the Interface 1 had a bad habit of melting - I think we had about five over the years.

I must confess to killing the last surviving Spectrum by taking it apart, then breaking the membrane cable connecting the keyboard to the motherboard (motherboard? Only board!)

I pestered my dad for an Archimedes for a while (we had these at school) but, sensibly it turned out in the end, he bought an ICL DRS M30 PC (discounted, since he worked for ICL). This had - compared to the Spectrum - a shockingly fast 16MHz 286 processor, a massive 2MB of RAM and an enormous 100MB hard disk (partitioned into three 32MB partitions and a 4MB partition, because it came with MSDOS 3.3). This was in 1991, and I had all the fun of dealing with Windows 3.0 (yes, it was as bad as they say). This was the system I cut my teeth on (and the rest of the family too, considering the number of times I broke it) in the PC world.

This system got upgraded to 4MB, DOS 6.22 and Windows 3.11 over time, and got packed off to university with my sister, and we upgraded to a 486SX-33 with 8MB RAM and 420MB HDD. This served us all for two years before I bought my own PC (P120, 32MB RAM, 1.2GB HDD, Win95) for university.

After another couple of years I upgraded again to a PII-300, 128MB RAM, 8GB HDD passing the P120 down to my parents. Last year, three years after the PII, I built my own P4 2.8GHz, 512MB RAM, 120GB HDD. That's a system with processor clock 467x, memory size 10,922x, and storage capacity 1.2 million times my old Spectrum.

I never knew floating point was so complicated...

...but then my involvement with it is typically limited to randomisation values or APIs which are more convenient in floating point, but don't actually require that precision.

Visual C++ "Whidbey" (VC8) will have floating point optimisation pragmas.

Just a note to anyone dealing in currency values: don't use float. Instead, use a scaled integer. Some programming languages and environments offer built-in decimal or currency types (e.g. Visual Basic, SQL Server's money type, Ada).

Tuesday, 10 February 2004

If my boss finds this...

...I'm out of a job.

Also, if he finds the number of posts on CodeProject (782 in 10 months = ~2.6 per day since I joined) I'm definitely fired. Not all of them were posted during work hours... just most of them.

My contract says I'm not allowed to reveal trade secrets. I've skated awfully close to that on occasions.

In mitigation, I can only say that I'm bloody bored at work. It wasn't supposed to go like this. I was supposed to come in in the morning, pick up the project I was working on, for which I have all the requirements and specifications, code and debug all day, then go home. I might even do a little (unpaid) overtime, if I'm really into what I'm working on.

In practice, what happens is that the work expands to fill the time available. Otherwise, you really would be sitting on your backside with nothing to do. Even when there's nothing to do, modifying the few products we have is practically not allowed. R&D? We've heard of it.

Programming Pearls

We appear to have a programming strategy at my employer which could be termed the Pearl Fallacy.

Basically, the principle is that if you shovel enough shit into the codebase, it will eventually turn into a pearl. This only works for oysters.

It's more likely to turn into a gyppo (as defined by Terry Pratchett in Mort, IIRC - this post alludes to the meaning) - a solid looking crust on the outside, but absolutely disgusting if you put your foot through it.

(If you think I'm getting frustrated with my employer, you would be right.)

Wanted: a blog tool with a diff engine

If someone updates a blog post, I'd like to be able to see what the author changed. I've done the best I can with the change I just made.

Saturday, 7 February 2004

On royalties

[Updated 2004-02-10: a reader corrected me on ownership of MPEG-4 AAC, which bores a huge hole in my argument. Whoops. Removed aspersions on Apple.]

Another thing people forget about standards is the issue of royalties. Again, two fairly recent issues: WMA and MPEG4 AAC, and the whole débacle surrounding Rambus' role in SDRAM.

Just because something is a standard doesn't mean it's free to implement. The developers of the standard will often want compensation for having developed it. This applies even to ISO-mandated international standards.

Taking an example, consider the DVD and digital broadcast. Fairly ubiquitous, at least in the West, right? How much do you think you need to pay to make a DVD video, or broadcast a digital programme?

Digital broadcast and DVD both use the MPEG 2 video standard. Royalties apply to decoder hardware, encoder hardware and to recorded media. They're paid into a pool (the MPEG LA company administers this) which then pays out the the various developers. MP3's full name is MPEG 1 Layer 3 audio - and royalties are due to Fraunhofer IIS, who developed it.

AAC was developed by Dolby (licensing info); it was then standardised first in MPEG-2 (for bit rates higher than 32kbps per channel) and later in MPEG-4. I previously wrote in this paragraph that MPEG-4 was basically Apple's QuickTime 3; other sources make it clear that MPEG-4 only uses QuickTime 3's file format, not the audio encoding.

In these terms, it's understandable that Microsoft uses its own WMA format for their own Media Player software, rather than AAC. Once you realise this, it's down to terms and prices as to what gets licensed.

On standards

The software community is currently on a big standards kick. If you've developed something, you try to get it standardised (example: Microsoft pushing the CLI through ECMA and ISO's fast-track process). You then criticise everyone else for 'not following the standard' or for 'extending the standard.'

I don't actually care much about standards. They're useful, yes, but I'll use a non-standard product if it's better. There are two standards for the SQL database query language, SQL-92 and SQL-99. Most database products now support a subset of SQL-92; newer products are targetting SQL-99 (Microsoft's next release of SQL Server will have some SQL-99 features).

Can you produce a useful database application using only SQL-92 features? Possibly. Can you produce a better application using your vendor's proprietary extensions? Almost certainly.

POSIX is the ISO standard for interfacing programs to an operating system, and for presenting programs to a user. Which major operating system has virtually no POSIX features, and yet has over 90% of the installed base? The one that makes sense. The one that has richer APIs. The extent of POSIX support when programming to Win32 is that some of the POSIX extensions to the C run-time library are available - with a leading underscore - in Microsoft's C run-time. Compare _open to CreateFile and you'll get just a flavour of how much more Windows offers to the developer.

As developers, we have to weigh up whether following a standard is beneficial to our users: either in terms of being able to replace our software (erk!) or interoperate with other software. The downside is that it may be difficult to follow the standard, or it may simply be overwhelmingly complex for the particular application.

Microsoft have been criticised recently (i.e. in the last six months) for both the new Office 2003 XML schemas for Word and Excel (a lot of complaints emanated from Sun^H^H^HOpenOffice because Microsoft didn't use their schema) and for the WinFS schema language - why didn't they just use XML Schema?

In the first case, features of Word and Excel simply didn't map onto the OpenOffice schemas in any reasonable form. And in the second, XML Schema is simply too complicated, and doesn't match the WinFS object model. With this much mismatch, the programs would probably contain more code and run slower than they could have, leading to unhappy users. In the former case, OpenOffice's schema is probably a good match for their internal object model - but it doesn't benefit Microsoft.

Take the flip side: when Microsoft were looking for a better authentication technology for Windows 2000 Active Directory domains, they could have invented their own. Instead, they went with Kerberos. Why? Because it had the features they were looking for (a distributed authentication system, without needing to centralise authentication to a single group of servers) and was already known and trusted. However, it didn't have the ability for the administrator to change a user's password remotely. So Microsoft added that feature to their implementation of Kerberos.

Sun, as always, cried foul! Our Kerberos implementation (or, rather, MIT's) doesn't do that. MS must be making their client work best with their server - let's complain to (in this case) the EU.

But who benefits? The users do. The administrator doesn't have to open a remote command shell on the login server to modify the password file directly. Computers can add themselves to the domain, generating their own passwords (in Windows domains, computers have accounts as well as users) without requiring the administrator to explicitly set one on the login server, then enter it correctly on the computer.

It's helpful to control both ends of the connection, because you can make more extensive changes without reference to others. But I don't believe that Microsoft are deliberately trying to disadvantage their competition - they're just looking for ways to improve the user's experience through technology.

On XML 1.1

Dare blogs about XML 1.1.

Reading the summary Dare links to reminds me of C99: a solution looking for a problem.

There isn't anything majorly wrong with XML 1.0, and the things that are wrong with it aren't fixed in 1.1 anyway.

I don't mind standardisation committees working on producing standard versions of what had been slightly-incompatible variants of a technology. However, a standards committee deciding on extensions to an existing standard have a real tendency to either go over the top (C99) or tinker with things that don't need fixing (XML 1.1). I hope C++ 0x doesn't turn into a nightmare, but the signs at present are that the library will get a lot of extensions, the language will get a few minor ones, and (as long as you haven't used any new keywords) your C++98 programs will upgrade directly.

Indeed a standards committee can end up operating in a rareified academic atmosphere, inventing something that's only of very vague relevance to commercial software development.

Of course the C++ standardisation committee itself ended up inventing things like export, which actually turns out to be fairly useless. It might slightly reduce the time taken to compile a large C++ program with a lot of templates, but Visual C++'s precompiled headers (apparently something similar is also supported by GCC) already exist and are likely to be more beneficial. The goal of shipping a binary version of a template is scuppered by the fact that C++ allows so much variation between instantiations of a template - it can't simply be implemented by 'dope lists' as Ada generics can. The CLR gets away with it by instantiating at runtime, but this works against the general philosophy of C++.

About the only instance of standards upgrade done right I can think of is Ada 95, but even that broke a number of Ada 83 programs, and introduced a very strange object-based syntax for allegedly object-oriented programming.

Thursday, 5 February 2004

Correcting injustices

I need to correct a bit of an injustice I did to the CLR a couple of weeks ago. Since then I've bought and read Shared Source CLI Essentials, which covers a large proportion of the CLR/CLI codebase.

It turns out that JITted code (i.e. everything that is emitted as IL) actually uses dynamically-generated exception tables with a single exception handler around all JITted code. Any calls to unmanaged code also get an exception handler so that an unmanaged exception can be converted to a managed one. This reduces the number of user-to-kernel-to-user transitions that occur.

I did try to work out how the exception handling works, but disassembling the free build of ntoskrnl.exe is an exercise in frustration, especially when you don't have symbols (my main development machine at home is not networked, but it does have some patches after SP1 installed, so my SP1 kernel symbols don't match). Maybe the checked build would be better...

I had the thought that maybe you could hook the exception handling scheme with a driver, which would perform the whole unwind in user mode, but you'd still take the initial hit of a kernel transition on a throw.

The best conclusion is to realise that exceptions are for exceptional circumstances, where we don't care that it takes a little longer to change the point of execution. Microsoft pull less-used blocks of code (such as error handlers) out of the main path in the system DLLs, which can make it harder to follow; these cold blocks are placed in a different part of the DLL to reduce the working set of the normal execution path. RaiseException in kernel32.dll has two displaced cold blocks, IIRC - one for the case where you pass a NULL lpArguments, and another where you try to pass more than EXCEPTION_MAXIMUM_PARAMETERS in nNumberOfArguments.

Microsoft Offers 64-bit Windows XP Preview

"For the first time, Microsoft has posted a free public preview of a 64-bit version of one of its operating systems, opening up the AMD-based Windows XP 64-bit Edition for 64-bit Extended Systems for one and all. While the company has actually been testing the software since last fall--a previous beta was distributed at October's Professional Developers Conference (PDC) in Los Angeles --this is the first time the company has broadly released this type of preview software." -- Paul Thurrott's WinInfo

"MICROSOFT HAS MADE a trial version of its hotly-anticipated Windows XP for 64-bit AMD chips available for download.

"The trial OS is designed, says the Vole, for systems sporting Athlon64 or Opteron processors. Don't bother trying it on an Itanium, it advises." -- The Inquirer

People have short memories. There was a Windows Advanced Server Limited Edition (basically Windows 2000 SP2) for Itanium, essentially as a preview release, before Windows Server 2003 came out. Windows XP Professional 64-bit (for Itanium processors) is listed on MSDN Subscriber Downloads as released on 26th March 2003.

Most Itanium workstation systems will have been purchased with Windows preinstalled, if the user wants Windows. An Itanium cannot run a 32-bit x86 operating system, it must have a native OS (IIRC). By contrast, an AMD64 system can still run 16-bit code in standard mode, and can boot a 32-bit operating system, so there are currently AMD64 systems out there running the 32-bit versions of Windows.

Monday, 2 February 2004

IE Security Update

It's been a long time coming, but it's here.

I defer to JeffDav for the detailed information.

Apply now: windowsupdate.microsoft.com.

More BlogJet

Yep, definitely something worth having. That last post came out perfect (barring a little monkeying with Paste Special with the quote from Chris' blog).

On COM, pumping and the CLR

Chris Brumme's come up trumps again: Apartments and Pumping in the CLR.

This reminds me of something I've been working on last week (and I probably shouldn't be telling you about, but I'm beginning to get fed up anyway). We have a thin-client application server written in VB6. Under a fairly large amount of stress, we got Automation Error -2147417843: "An outgoing call cannot be made since the application is dispatching an input-synchronous call". What had happened, I think, was that we'd tried to call out to one of our 'transaction' objects from within VB's message filter, which it uses to ensure that things like painting and pop-up menus happen. The message filter is used when STA thread A has called an object on thread B (not in the same apartment) and is waiting for B to complete its call. A can't just block, because it needs to be able to handle any recursive calls from B to A.

I tracked this down to excessive use of DoEvents in certain areas of code. As Chris says, "Deadlocks are easily debugged.  Reentrancy is almost impossible to debug." Too right. The simple, and, it turned out, correct approach was simply to remove calls to DoEvents.

VB developers seem to be quite keen on doing the least possible thinking to resolve a problem. Code blocking? DoEvents. Need asynchronous behaviour? Use a Timer control. The Right Answer can often be to get out of the VB environment and write a multithreaded COM component to fire an event when it's finished. With this approach, though, this application server will soon have no VB code left.

Hey, this could be a good thing...

Anyway, this server has bigger problems - like consuming over 50% CPU on a 1.8GHz P4 when handling less than 2000 transactions a minute. SQL Server was just laughing at us, typically idling along at less than 1% CPU (on a separate box). A bit of judicious transferral of operations (changing a chatty interface to a slightly more chunky one, with more work done outside the bottleneck server process) improved matters.

.NET Interop is becoming, er, entertaining, too. The application server does nothing useful on its own - it must host an application, which is a COM object (connected server-to-application using Automation late-binding - simple, but inefficient). COM Interop in the CLR allows us to write applications using VB.NET (thank the gods, an environment that doesn't utterly suck). I'm not sure of myself in this environment yet, though. Should I ever call Marshal.ReleaseComObject? Should I be calling it every time?

Publisher Policy bit me as well. It's not well documented. I should write an article for it (here or codeproject). If you don't know what it is, forget about it ;)

I've been vaguely considering re-implementing this server using a .NET language for a while. Might do a bit of that.