Home Contact Projects

SilvIR: Definition, Draft 12021-02-12

This post assumes you've read the previous one in the series.

As a terminology note:

Overview

We'll explain the outline of what evaluation of a program written in Silver and compiled to SilvIR looks like with an example program that implements a simple language translator tool. The actual grammar will be detailed below, but an overview of how these pieces will fit together might be useful context.

Suppose that this program starts, parses a file, checks it for errors, performs some optimizations, and writes out the optimized version.

When a SilvIR program starts, it invokes a main function. This main function probably pattern-matches on command-line arguments, then invokes the parser on the file specified by the user.

The workings of the parser are outside the scope of SilvIR; it gets called into via a foreign function call. The interface it uses to construct terms is backend-dependent, but it constructs them in the same way a SilvIR program would, and returns them to the SilvIR program.

Note that terms are pure data; one could imagine a datatype that represents arbitrary terms like:

data Term = MkTerm
  { name :: ProductionName
  , children :: [Term]
  }

The compiler then needs to check for errors. This requires providing the AST with an inherited attribute for the environment, so a decorate expression is used. The process of a tree being created corresponding to the term subject to the decorate expression is known as "tree-construction time." This process begins by allocating a tree with identical structure to the term, then traversing through it, running a series of functions ("production bodies") corresponding to the production used at the node to set values for attributes. This includes both setting inherited attributes (or thunks corresponding to them) on children and setting synthesized attributes (or thunks corresponding to them) on the node itself.

After this is done, the attribute value on the root node corresponding to the errors attribute is retrieved. It's a thunk, since synthesized attributes are lazy. (It's also usually defined as a collection attribute, but to keep this example simple, we'll pretend it wasn't, or that whole-program-optimization transformed this away.) Forcing the thunk requires retrieving the errors attribute from each of its children and forcing the resulting thunk. This downwards thunk-following occurs until a leaf node is hit.

If the leaf node's errors is (a thunk around) the empty list, this'll get bubbled up and appended to other lists of errors.

Let's say one of the errors we wish to detect is a variable being used without being defined/bound. The errors attribute therefore depends on an inherited attribute, env. In the simple (unoptimized) case, this results in retrieving the env attribute from the node itself. This will be a thunk set by the parent to refer to the parent's env (or something consed onto it, etc.), so fully forcing the environment would result in traveling up the tree as well.

Once all the thunks have been settled and we're back in the main function, we probably pattern-match on this list to check if it's empty. If it's not, we print it, which probably results in a bunch more thunk-forcing, etc.

Assuming there were no errors, we go on to getting an optimized version as a higher-order attribute. These evaluate the same as any other synthesized attributes, but instead of thunks around lists, they're thunks around terms.

We then want to convert this term to a string in order to write it out. This is usually defined as an attribute as well, so we again decorate it to construct a tree for it, and demand the synthesized attribute unparse. Once this comes back, we can write it out to a file.

Lifecycle of a Term

In the above, we can see a general pattern for how terms are created and used.

  1. A term is created, with the cons AST node. This might be reasonably called term-construction, but it's frankly not interesting enough to deserve a name, since terms are plain data.
  2. The term is decorated to create a tree, with the decorate AST node. This is known as "tree-construction." The essential purpose of this is to store thunks for each equation as attributes on the tree.
  3. An attribute is retrieved from the tree, with the getAttr AST node. This does not yet involve evaluating the attribute equations, just retrieving the thunks.
  4. The thunk is forced, with the force AST node. Thunks can contain arbitrary expressions, so this can construct terms or trees, access attributes, force other thunks, etc. As is usual for lazily evaluated languages, once a thunk has been forced, a value is written back, and attempting to force the same thunk again returns the same value, rather than recomputing it.
  5. A tree can be turned back into a term with the undecorate AST node. This corresponds to the new function in Silver, but has (in my opinion) a much more descriptive name.

This is approximately how things currently work in Silver; one can contrast this with systems such as JastAdd, which don't store thunks on the values themselves. It is possible for a SilvIR backend to perform that optimization, but the semantics are significantly cleaner if everything is defined in terms of thunks, since Silver is a lazy language.

The Grammar

As covered in the previous post, Silver has a bunch of features that require some trickiness in SilvIR to make implementable. In the rest of this post, I'll go over the grammar in pieces.

Note that I'm intending for the actual final grammar to use Administrative Normal Form as a way to make the evaluation order explicit, and make writing correct optimizations easier. However, this definition doesn't contain that, for simplicity's sake. Next post does, but I'll try and ease into it.

Literals

Literal ::= Int
         |  String

Literals are used in a few places in the grammar; they are a subset of the runtime values that exist. Runtime values also include functions (which are constructed with lam), thunks (which are constructed with thunk), records (which are constructed with makeRecord), terms (which are constructed with cons), and trees (which are constructed with decorate).

Patterns

PrimPattern ::= varPat(LocalName)
             |  litPat(Literal)
             |  anyPat

Pattern ::= recordPat(Map<String, PrimPattern>)
         |  treeOrTermPat(ProdName, List<PrimPattern>)
         |  primPat(PrimPattern)

Patterns in SilvIR are restricted to "simple patterns," which disallow nested patterns. SilvIR patterns also do not interact with forwarding in any way; there will be later discussion on this.

Note that recordPat implements a "subset match"; fields not present in the pattern are ignored. As an example, the following expression evaluates to 1:

case(makeRecord({ "a" = lit(1), "b" = lit(2) }),
  [ (recordPat({ "a" = varPat("x") }), local("x"))
  , (anyPat, lit(3))
  ])

Everything else in patterns should have fairly intuitive semantics.

Exprs and TopLevelItems, the FP-looking parts

Expr ::= local(LocalName)
      |  global(GlobalName)
      |  lit(Literal)
      |  let(LocalName, Expr, Expr)
      |  letrec(Map<LocalName, Expr>, Expr)
      |  lam(List<LocalName>, Expr)
      |  call(Expr, List<Expr>)
      |  error(Expr)
      |  thunk(Expr)
      |  force(Expr)
      |  case(Expr, List<Pair<Pattern, Expr>>)
      |  pureForeign(String, List<Expr>)
      |  impureForeign(String, List<Expr>)
      |  makeRecord(Map<String, Expr>)
      |  getRecordMember(String, Expr)
      |  ...

TopLevelItem ::= globalDecl(GlobalName, Expr)
              |  ...

Program ::= Set<TopLevelItem>

Most of what appears here should look "fairly normal" for a dynamically-typed, call-by-value, functional language with multi-argument functions and optional laziness. (Well, I guess I said "tastes like Scheme" in more words...)

SilvIR is a call-by-value language, but supports thunks as a built-in type in order to support Silver's demand-driven attribute evaluation strategy. Thunks are created with the thunk AST node, but also result from using global to access global variables, and local to access variables bound by a letrec. This is because laziness is also used to implement circular/recursive values, which one can create with both global bindings and letrec-created bindings.

Calls to foreign functions are split into "pure" and "impure" versions. Essentially, a call is considered pure if there is no non-UB way for the function to exhibit side effects, including non-termination. This gives the optimizer leeway to aggressively transform, duplicate, or eliminate calls to these functions. Examples of pure foreign functions are arithmetic operators on integers, string concatenation, and reflect. Examples of impure foreign functions are genInt and error (though, error has its own AST node to facilitate debugger integration).

At program startup, a global environment is established from the top-level items. The expression call(force(global("main"))) is then evaluated.

Exprs and TopLevelItems, the attribute-grammar-specific parts

The attribute-grammar-specific parts of SilvIR are the complicated part with potentially-controversial semantics, so this might be unclear. Comment if you think there's something that could/should be explained better!

Undecorated terms can be constructed with the cons expression. It takes the name of the production the value belongs to (sometimes referred to as the tag) and a list of arguments, as well as a flag describing whether the child is decorable. This flag (formerly, misleadingly, known as IsChildDeclaredDecorated) determines whether the child will be traversed at tree-construction time. This is effectively a single bit of type information; in the future, SilvIR may be changed to provide this information via prodDecl instead.

Most children of nonterminal type should be declared childIsDecorable. Examples of types for which this should be childIsntDecorable are Integer (since integers aren't nonterminal types), Decorated Foo (since references to other trees shouldn't be redecorated), and skolem variables (since these may be instantiated to one of the former). If the exact semantics of this are unclear, they should hopefully become more clear later, when the execution semantics are gone over.

IsChildDecorable ::= childIsDecorable
                  |  childIsntDecorable

Expr ::= cons(ProdName, List<Pair<IsChildDecorable, Expr>>)

getChild reads a child from a term or tree, while getAttr reads an attribute off of a tree. It's UB to read a child or attribute that doesn't exist.

Children and attributes are both simple data members of trees and terms. This can be implemented as a vector of children and a map from strings to values for attributes, though different implementations of SilvIR can choose a different representation for optimization purposes.

Expr ::= getChild(Nat, Expr)
Expr ::= getAttr(AttrName, Expr)

setAttr and combineAttr are used at tree-construction time to write attributes to a tree. setAttr(attr, tree, value, next) evaluates to the same value next evaluates to, after performing the side effect of setting the attribute attr on the tree tree to the value that results from evaluating value. This possibly replaces a previous value of the attribute.

combineAttr(attr, tree, value, func, next) is similar, and is used to implement collection attributes. If attr was already set on tree, it sets it to the result of evaluating call(func, getAttr(attr, tree), value). If attr was not already set on tree, it sets it to the result of evaluating value. This whole combineAttr expression then evalutes to the result of evaluating next.

Expr ::= setAttr(AttrName, Expr, Expr, Expr)
Expr ::= combineAttr(AttrName, Expr, Expr, Expr, Expr)

undecorate returns a copy of a tree as a term, stripping off all attributes on decorable children. This copy is deep in the decorable-term-structure of the tree, but shallow in the non-decorable-children.

Expr ::= undecorate(Expr)

All these pieces come together with decorate. decorate's first argument is a term to perform tree-construction on. First, a tree is allocated, mirroring the structure of the term, replacing decorable children with freshly allocated trees. Then, tree-construction proceeds to the initialization phase; for each node:

Priorities allow for default productions and the various automatic attributes to be implemented correctly, and negative priorities allow for the optimization where inherited attributes are passed down strictly, avoiding a traversal back up the tree when they're demanded.

Note that productions are declared, while nonterminals currently are only implicitly declared by appearing in prodDecl. An explicit nonterminal declaration might be added in the future if some metadata needs to be attached to nonterminals.

Expr ::= decorate(Expr, Expr)

Priority ::= Int

TopLevelItem ::= ...
              |  prodDecl(ProdName, NTName)
	      |  defaultProdBodyDecl(NTName, Priority, LocalName, Expr)
	      |  prodBodyDecl(ProdName, Priority, LocalName, Expr)

Remarks

There are a bunch of things that are possible to express with this grammar, but are "bad" for one reason or another. For example, consider call(lam(["x", "y"], var("x")), [lit(1)]), where the wrong number of arguments are applied to a function. These are termed "undefined behavior." The optimizer and backend are allowed to assume undefined behavior does not happen, in the same way a C compiler can (for C's notion of undefined behavior), and that the same "nasal demons clause" applies.

Many examples of undefined behavior are shown in the next post, but they include type errors, function call arity errors, invalid use of foreign functions, etc. Before formally defining every bit of undefined behavior, I think I'll want to have experience from implementing at least one AOT backend.

Annotations aren't covered here; I think they can be transformed to children?

In the next post, we'll look at the execution semantics of this IR in more depth, and look at a simple grammar translated to the IR.