Monday, August 17, 2009

A guide to pork, part 4

The last portion of the guide started covering declarations. This week, I will be covering a lot more about declarations. In particular, types and names are covered in a lot more detail. I had intended to talk about classes in more detail as well, but the post was getting long enough as it was, so I'll save discussion for a fifth part.

What has been covered so far:
Step 1: Building and running your tool
Step 1.1: Running the patcher
Step 2: Using the patcher
Step 3: The structure of the Elsa AST
Step 3.1: Declarations and other top-level-esque fun
Step 3.1.3: Function
Step 3.1.6: TypeSpecifier
Step 3.1.11: Declarator
Step 3.1.12: IDeclarator
Step 3.4: The AST objects that aren't classes
Step 3.4.5: DeclFlags

Aside 1: An introduction to porky (continued)

It seems that Chris Jones finally blogged about porky. If you're interested, go read about it.

Aside 2: Pork Web

In the course of writing this guide, I got the idea of writing a tool to display the Elsa AST nodes without having to constantly fidget around with dumpAST. The result is Pork Web, which is also a good expository of how much a little CSS will get you.

Step 3.1.6: TypeSpecifier (continued)

In the last article, I mentioned TypeSpecifier but elided details of its subclasses, who hold the interesting information, because I held a misunderstanding of key pieces of information.

For projects that are sufficiently large to be considered good candidates for automated rewriting, chances are that basic types like int are going to be rather rare, in favor of typedefs that give more precise storage sizes (such as mozilla's PRInt32). The parsing of the AST in Elsa and pork happens at a different stage from the type verification, which means that typedefs have an impact on the structure of nodes. That is not to say that you can't get type information; it just means that you want to use Elsa's type information (embodied in Variable) for more accuracy here. Naturally, #define has no impact on type information, because we are dealing with preprocessed files.

Which of the subclasses of TypeSpecifier is used depends on the format. If you are using a standard type keyword like int, you get the TS_simple flavor, which I covered last week. Structures parsed as classes in C++ (i.e., class, struct, and union) are all TS_classSpec nodes; enums form TS_enumSpec nodes. Class nodes, if you do not provide an actual definition, are classified as TS_elaborated nodes. If all you have is a simple name, then the node is a TS_name node, regardless if that type is a class, enum, or other such type. Names will never be null; for anonymous constructs like enum {a} x;, a unique string beginning with __ will be used instead.

TS_name has two variables: a PQName *name, and a bool typenameUsed. Both of these parameters are self-explanatory. For the curious, the latter comes about via an elabarator of typename, such as in the below:
template<class T> class Y { T::A a; };

TS_elaborated again has two variables, the same PQName *name variable, as well as a TypeIntr keyword variable. The keyword variable is an explanation of which keyword was used as the elaboration.

TS_enumSpec has again two variables, this time a StringRef /*(const char *)*/ name, as well as a FakeList<Enumerator> elts, which contains the elements in the enumeration.

TS_classSpec is the most complex of the subclasses, as it represents the definition of a class. It contains the same name and keyword variables as TS_elaborated, but it also has the base classes in the form of a FakeList<BaseClassSpec> *bases and its members in a MemberList *members.

Step 3.4.9: SimpleTypeId (Primitives, if you come from Java)

The SimpleTypeId enum represents the primitive types defined by C++, namely char, bool, int, long, long long, short, wchar_t, float, double, and void, as well as their unsigned and signed counterparts (if they exist). The name of each of these follows the general scheme ST_UNSIGNED_INT, although short and long are ST_LONG_INT and ST_SHORT_INT, respectively (but not long long!).

That's not all, though. For simplicity, some places have fake type codes. The most common of these will be ST_ELLIPSIS, the varargs portion of functions; there is also ST_CDTOR, the return type for constructors and destructors. The source code also mentions GNU or C99 support for complex numbers, but I have not found the magic needed to get those to work.

Step 3.4.4: CVFlags (CV-qualified IDs)

Whenever something can be const or volatile, there is a CVFlags enum. It can either be CV_NONE (no qualifiers), CV_CONST, CV_VOLATILE, or both of the latter. There also exists a method sm::string toString(CVFlags cv) that will print a string representation of such a variable. Need I say more?

Step 3.1.5: Declaration (The outer part of declarations)

Any time a variable is declared, one of the wrappers is Declaration (which may itself be found in various places). This has just three members, a DeclFlags dflags variable that represents the flags on the declaration, a TypeSpecifier *spec that is the type of the declaration, and the FakeList<Declarator> *decllist that contains the rest of the declaration. All of these have been covered in more detail earlier.

Step 3.4.11: TypeIntr (Differentiation between classes and structures)

TypeIntr is a little enum that has four members: TI_STRUCT, TI_CLASS, TI_UNION, and TI_ENUM. The descriptions of them are, I think, straightfoward. There is also a top-level method to convert to a string representation, char const *toString(TypeIntr tr), which will do what you think it does.

Step 3.1.8: Enumerator (The members of enumerations)

Within the definition of an enum is a FakeList of Enumerator nodes. These have a standard location and a StringRef name. The values can be represented in the potentially null Expression *expr variable, or the actual value in int enumValue.

Step 3.4.8: PQName (Everybody's name)

In declarations and other places, in lieu of a string representing name, you have the AST node PQName. The name stands for "possibly qualified." It comes about because there is the necessity of finding the different components of a name.

This class has four subclasses, PQ_qualifier, PQ_name, PQ_operator, and PQ_template. In addition to these, it has a plethora of functions intended to help you with the task of printing these names, as well as overloaded operators to aid in output (to std::ostream and stringBuilder). They are:

SourceLoc loc
bool hasQualifiers()
sm::string qualifierString()
sm::string toString()
sm::string toString_noTemplArgs()
StringRef /* const char * */ getName()
sm::string toComponentString()
PQName *getUnqualifiedName() /* (And a const version) */
bool templateUsed()

PQ_qualifier represents a namespace or similar component to a qualified name. This is handled in a right-associative manner, such that std::tr1::shared_ptr would be the qualifier std, which qualifies tr1, which qualifies shared_ptr. This class has three variables: StringRef qualifier (the name to the left of the double-colon), TemplateArgument *templArgs (which represents the template arguments for templated class qualifiers), and PQName *rest (the right of the double-colon).

PQ_operator represents a PQName that is actually an operator overload. It has just two variables: OperatorName *o, the operator in question, and StringRef fakeName, a string representation of the operator. The latter is essentially a space-less name of the function (except that operater new and friends are represented as such, as well as conversion operators having the poor name of conversion-operator).

PQ_template represents a templated argument name. It again has two variables: the StringRef name of the base type and the TemplateArgument *templArgs that contains the arguments to the templatization. Note that if you are getting a member of a templated class, the name tree will have the PQ_qualifier node instead.

PQ_name is the other of PQName (note the minor spelling difference). This has a single variable StringRef name which is the name. This is by far the most common name node, since everything that is not an instantiated template or an operator name will have this in the name somewhere.

For standard names, all of the various string output methods save qualifierString (which returns the empty string) will return the same thing, the name variable from PQ_name or the fakeName from PQ_operator. The differences arise when you have templates or qualified names.

If you have a qualified name, the methods change rather predictably. qualifierString returns the entire qualification string before the tail node (e.g., std::auto_ptr<T> becomes std::). The toString method and toString_noTemplArgs return the fully qualified names (optionally without template instantation). toComponentString becomes idempotent to the qualifier variable. getName will be essentially identical to getUnqualifedName()->getName(): it returns the name of the right most declarator.

Templated names also modify stuff predictably. toString_noTemplArgs and getName return the base name, without template arguments; toString and toComponentString return the name with template arguments.

The interesting stuff happens when you have a templated qualifier (e.g., std::set<int>::iterator). In that case, the toString_noTemplArgs will not strip the template args from the qualifier.

hasQualifiers() is identical to the isPQ_qualifier() method. templateUsed() is true if the qualifier or template used the template keyword. This is a feature that would be used for disambiguation purposes, such as this example (taken from the ISO C++ spec):


struct X {
  template<std::size_t> X* alloc();
  template<std::size_t> static X* adjust();
};
template<class T> void f(T* p) {
  // T* p1 = p->alloc<200>(); (p->alloc)<200>() -- syntax error
  T* p2 = p->template alloc<200>();
}

The last method to talk about is getUnqualifiedName. This method simply returns the PQName at the end of the name.

With all of the methods discussed, the most important question you're probably wondering about is the easiest way to get the name of a PQName. If you're trying to find a method or class name, getName is your safest option. If you need to know the type arguments as well (say you're looking for particular instantations), getUnqualifiedName()->toString() is a better option.

If you're looking at class members, you can probably use toString_templNoArgs successfully (when you're looking for a particular function), unless you're interested in the qualifierString() (when you're looking for any function of the class). For cases where the namespace information is necessary, you probably want to investigate the name with the parallel type APIs of elsa, unless you want to maintain your own state for using declarations and other shenanigans that make the original type non-obvious.

Unfortunately, while I earlier said that I intended to reference classes in this part of pork, it looks like I have not the time this week to cover them as well. At this point, it seems likely that part 5 will cover classes and part 6 will cover templates and errata. I cannot say what part 7 will touch: it will either be more errata or a start on the expression and statement code.

No comments: