Saturday, August 22, 2009

A guide to pork, part 5

In the previous two sections of my guide, I discussed basics of functions and then went into the specifics of types and names, as represented by the Elsa AST nodes. In this section, I will cover classes and some more errata on declarations. As I have mentioned earlier, the subsections of this third step will be visited out of the order of their numbering.

What has been covered so far:
Step 1: Building and running your tool
Step 1.1: Running the patcher
Step 2: Using the patcher
Step 3: The structure of the Elsa AST
Step 3.1: Declarations and other top-level-esque fun
Step 3.1.3: Function
Step 3.1.5: Declaration
Step 3.1.6: TypeSpecifier
Step 3.1.8: Enumerator
Step 3.1.11: Declarator
Step 3.1.12: IDeclarator
Step 3.4: The AST objects that aren't classes
Step 3.4.4: CVFlags
Step 3.4.5: DeclFlags
Step 3.4.8: PQName
Step 3.4.9: SimpleTypeId
Step 3.4.11: TypeIntr

Step 3.1.9: MemberList (inside a class)

MemberList has only a single variable: ASTList<Member> list.

Step 3.1.10: Member (part of a class)

Member nodes represent a member of a class type. Like other nodes, this is mostly represented by its subclasses, MR_decl, MR_func, MR_access, MR_usingDecl, and MR_template. The node itself has two members, SourceLoc loc and SourceLoc endloc. End locations represent the location just after the semicolon or closing brace, such that the range of text matches [loc, endloc); if you recall, this is the same syntax that the patcher uses when working with ranges of location.

Each of the subclasses of Member only adds one variable, which is the type that member represents. MR_decl adds Declaration d, hence it is used for all declarations within the class, be it a variable declaration like int x, a type definition, or a function without a body. MR_func adds Function f, so it represents all functions that have their body in the class declaration (A(); is a declaration, but A() {} is a function). MR_usingDecl has as its member ND_usingDecl decl (which is covered under the section NamespaceDecl), and MR_template uses TemplateDeclaration d.

The final subclass of Member is MR_access, which has its member AccessKeyword k. This node represents all of the declarations like private:. Since the information keeping track of the access is stuffed in separate AST nodes, you may be wondering how to retrieve this information given only a specific member. The answer, naturally, lies in the auxiliary classes to the AST, something which I have avoided mentioning. Some nodes provide access to a Variable member, one of whose methods retrieves the access to the member. More information about this will be discussed when I talk about that class in detail.

A final thing to note is that Elsa will add some nodes into the AST by the time you use the visitor. These are the implicit methods dictated by the C++ standard. You can check if one of these members is implicit if the DeclFlags variable contains DF_IMPLICIT. Another flag that will also be set is the DF_MEMBER flag.

Step 3.1.7: BaseClassSpec (extending classes)

BaseClassSpec nodes represent a superclass for a class. It has three variables, bool isVirtual, AccessKeyword access, and PQName name, all of which are self-explanatory.

Step 3.4.1: AccessKeyword (controlling access)

AccessKeyword is an enumerated type with three important members. These are AK_PUBLIC, AK_PROTECTED, and AK_PRIVATE, all of which represent what you think they represent. There is also a AK_UNSPECIFIED member, but that should not be present by the time you get to the AST nodes. Naturally, there is also a const char *toString(AccessKeyword) method for converting these types into a string.

Step 3.4.7: OperatorName (operator overloading)

These nodes are informational nodes about operators. You should only find them within the PQ_operator type. The class OperatorName only has a single method const char *getOperatorName(). This is the basis for the PQ_operator name strings, so its results are as mentioned there.

The first subclass, ON_newDel, represents the memory operator overloads. It has two members, bool isNew and bool isArray. The first differentiates between operator new and operator delete, the latter differentiation between operator new and operator new[].

The second subclass, ON_operator, represents the standard operator overloads. It has only one member, OverloadableOp op, which represents the operator being overloaded. The names in the OverloadableOp enum all begin with OP_ and can be idiosyncratic. Example operators are OP_NOT, OP_BITNOT, OP_PLUSPLUS, OP_STAR, OP_AMPERSAND, OP_DIV, OP_LSHIFT, OP_ASSIGN, OP_MULTEQ, OP_GREATEREQ, OP_AND, OP_ARROW_STAR, OP_BRACKETS, and OP_PARENS. There are naturally more, but the other names should be derivable from this sample (pretty much all the idiosyncracies were added to the list); the full list is in cc_flags.h if you need to see it. There is also the standard toString(OverloadableOp) method if you are confused about a particular operator.

Note that some of the operators can be used in different ways. For example, OP_STAR both represents the multiplication operator and the pointer dereference operator. The way to differentiate between the two is via the number of arguments, although one must keep in mind that operators that are class members have one less argument. The postfix increment and decrement operators are differentiated from the prefix forms in that the postfix forms add a second int argument, which is incidentally always 0 if you don't explicitly call the function.

The final subclass is the type conversion operator, ON_conversion. This contains a single member, ASTTypeId type. The member type will have a terminal D_name in its declaration with a null name; the main purpose of the declaration under the ASTTypeId is to capture the pointers.

Step 3.4.2: ASTTypeId (a less powerful version of Declaration)

ASTTypeId is modelled after the Declaration node, but it's used in places where multiple declarations are not usable. Indeed, its most common usage is to represent a type (nominally TypeSpecifier) that can have pointers or references. It has two members, TypeSpecifier spec and Declarator decl, both of which act as their analogues in Declaration.

Step 3.1.4: MemberInit (simple constructors)

When working with constructors, the initialization of members is treated separately from the rest of the constructor. In Elsa, the nodes where this happens are the MemberInit nodes. These nodes contain a few members:

SourceLoc loc
SourceLoc endloc
PQName *name
FakeList<ArgExpression> *args

The source location and end locations have the standard meanings. The name attribute refers to the name of member being initialized. The arguments refer to the arguments of the function-like calls.

Step 3.1.14: Initializer (the last part of declarations)

The Initializer nodes represent various ways to initialize an object. The class itself has a single member, SourceLoc loc, but it has three subclasses, each representing some form of initialization.

IN_expr represents the standard forms of initialization people are used to seeing, something along the lines of int x = 3;. These nodes have a single member in addition, the Expression e member which represents the expression initializing the declaration.

IN_ctor represents the constructor-like initialization forms, such as int x(3);. These have a single member, FakeList<ArgExpression> *args, which represents the arguments within the parentheses.

IN_compound is the final form, which represents the array-like initialization for structs or arrays. For example, int x[1] = { 0 };. This has a single member as well, ASTList<Initializer> inits, which is a list of the initializers within the aggregate syntax. Some words of caution, though, is that aggregate initialization can have unexpected results: multidimensional arrays need not have nested braces, and, in C++0x (and gcc since a long time, though it gives you a warning), you can also omit the braces for nested structures. Bit-fields and statics are omitted from initializers, and, if you have less elements in the initializer than you need, the rest are "value-initialized" (i.e., the equivalent of 0). Elsa, unfortunately, does not aide you any further in deducing which element is actually initialized by any given initializer.

Step 3.1.13: ExceptionSpec (saying what you may throw)

ExceptionSpec nodes correspond to the throw declarations on function declarations. These nodes only contains one member, FakeList<ASTTypeId> *types, which represents the types that method is declared to be able to throw.

That is all I have for this part of the pork guide. Part 6 should finish up sections 3.1 and 3.4, so I look track to have part 7 start discussion statements and expressions. I will probably defer the auxilliary API until around part 9 or so, as I really need to play with it some more first.


Ameet said...

Hi Joshua, pardon my ignorance. I ve been trying to use pork for parsing c++. One of things I am stuck on it how to identify a #pragma in the AST? Question is: does pork put the #pragma as a node in the AST? how to read it? Is there any example? Any help is really appreciated. Thanks in advance

Joshua Cranmer said...

No, elsa does not parse #pragma (it is part of the preprocessor, which pork by and large does not know how to handle).

toastfr35 said...

Hi Joshua, thanks for this great introduction to pork. I'm now trying to use pork to extract a callgraph from some C++. I've got a visitor that traverses the AST but I haven't yet found a good way to 'connect' function calls to function declaration or definition. Especially in the case of template function and template classes. Do you have any advice? Thanks.