Sunday, 18 July 2004

Academic biography and C++ parsing

Of course, due to my peculiar route through university, I ended up studying both the C and C++ courses twice, which tended to reinforce the principles - then for my final year project, I got very familiar with the C++ syntax and semantics. My friends already know my background, but I've had a few comments from people who don't, so here we go:

I entered Aston in September 1996 to study for an M.Eng in Electronic Systems Engineering - a four year full-time course. For any foreigners, a normal Bachelor's first degree in the UK has three years of full-time study, often optionally with an additional year spent on industrial placement. Normally, a Master's degree is postgraduate study, but M.Eng is atypical - it's a four-year first degree. When I started, you were supposed to get industrial placement work during summer vacations, but this rule was relaxed when it became clear that there weren't any placements available.

I attended that course for three years, scraping passes after resitting exams in both the first and second years (in one resit exam, my grade shot up from a failing grade to over 80%). In the third year a combination of factors (illness, extensive involvement in Entertainments, increasing complexity of mathematics) led to my deferring the exams. Over that summer I questioned my future direction and decided to apply for a transfer to Computing Science, which was granted. The departments accepted that my studies so far had covered all the material necessary to proceed directly to year 2. I completed year 2 with decent grades, then completed the final year fairly well, graduating with Lower Second-Class Honours (a 2:2, or Desmond).

The main thing that dragged my final Honours grade down was my final-year project. I'd decided to tackle a program to diagram the static structure of a program, with C++ as the source language. I got totally bogged down in the syntax and semantics of C++, and basically failed to complete the project. I don't do things by halves - I was trying to persuade a variety of parser generator tools to interpret standard C++ correctly. I suppose that I was taken in by my Programming Language Implementation lecturer, who claimed that parsing was the bit of CS that could be called 'science', because we know how to do it.

Not for C++, we don't. The commercial C++ parsers are either hand-written recursive-descent parsers which don't really follow any 'scientific' method (Visual C++), or based on yacc/bison with so much feedback between parser and lexer stages, and with so many bizarre actions in the middle of rules, that you can't call them LALR(1) any more (GCC).

Yes, I could have adapted GCC to handle the parsing. But quite frankly I didn't want to spend three months or more understanding how to hook into it.

The problem with C++ is just that it's hugely ambiguous and overloaded, with the distinguishing symbols - if any - an indeterminate distance from the beginning of a clause. This makes it great to write, IMO, and it's OK for humans to read given that we can see the whole line, but very hard to write a parser for. The syntax actually allows a great deal more expression than most programmers ever use, but the parser really has to accept all of it.

C++ really requires a stronger parsing method: LR(k). I now recommend that people look at Elkhound if they want to parse C++ and don't want to use an available parser.

What I should have done was to use Visual C++'s CL and BSCMAKE tools to generate the source browser files and then use the Browser Toolkit to interpret the BSC file, and generate diagrams from that. But I don't think I knew it existed at the time.


Stuart Dootson said...

Your best bet would have been to parse something easier, like Ada, for example, instead. I've written an Ada 83 parser in a Lex/Yacc-a-like parser generator, but wouldn't even consider C++ - I'd use something like the Program Database Toolkit for that - it's based on the Edison C++ parser, so is going to be pretty standards compliant.

Mike Dimmick said...

Yeah, I know. Towards the end of the schedule I was seriously concerned and started looking at Java (which is a lot easier to parse), but I really left it too late.