Home Contact Projects

Nanopass-friendly Attribute Grammars (Sketch)2023-04-26

EDIT: This idea has been cleaned up and published as Nanopass Attribute Grammars!

This is a sketch of an idea I've been noodling on for a few months, but haven't had time to work on since I need to graduate.

(Also, the name needs work, it just happens to be nonconflicting and it's how I arrived at the idea.)

Core Idea

(I'm pretending there's a single sort in the grammar for now, we'll get back to that later.)

Productions are no longer associated with a single nonterminal. Instead, occurrences (and their flowtypes) are associated directly with productions.

Nonterminals are collections of productions. Importantly, a production can belong to multiple nonterminals. The attributes that occur on all productions in the nonterminal (with a flowtype that is satisfied, etc.) occur on the terminal as a whole.

Productions have access to an additional type they can use in their children, This. The This type is, in essence, the nonterminal the production is currently acting like a member of.

As a running example, let's consider a language with boolean literals, the unit literal, printing booleans, and if expressions (with and without an else), in which we want to desugar IfThen(c, t) to IfThenElse(c, t, UnitLit()):

prod BoolLit(b: Bool);
prod UnitLit();
prod Print(expr: This);
prod IfThen(cond: This, then_: This);
prod IfThenElse(cond: This, then_: This, else_: This);

nonterminal Expr0 { BoolLit, UnitLit, Print, IfThen, IfThenElse }
nonterminal Expr1 { BoolLit, UnitLit, Print,         IfThenElse }

In addition to the standard kinds of attributes (inherited, synthesized, collection, etc.) we have a transform attribute, which feels somewhat like a functor attribute or a forward equation. Unlike a forward equation, they can be aspected on:

transform attribute toExpr1: Expr1;

aspect IfThen(cond: This, then_: This) {
  this.toExpr1 = IfThenElse(cond, then_, UnitLit());
}

Inside the equation's body, This = Expr1. From the outside, e : Expr0 |- e.toExpr1 : Expr1; demanding the attribute implicitly does all the recursion involved.

Extending to multiple sorts

In reality, we probably want multiple sorts we can simplify at once, e.g. Expr and Type. We might want to have the idea of a language; a collection of nonterminals that a production may reference (generalizing This-types) and transform attributes operate on:

language L0 {
  nt Expr { BoolLit, UnitLit, Print, IfThen, IfThenElse }
  nt Type { Bool, PeekABooL, Unit }
}

language L1 {
  nt Expr { BoolLit, UnitLit, Print,         IfThenElse }
  nt Type { Bool,            Unit }
}

transform attribute toL1: L1;

prod IfThen(cond: This.Expr, then_: This.Expr) {
  this.toL1 = IfThenElse(cond, then_, UnitLit());
}

// OK, this is maybe not interesting but I didn't wanna complect L0...
prod PeekABooL() {
  this.toL1 = Bool();
}

We then have both:

Other things

Closed NTs

I haven't really been thinking about this idea in the context of a "real" extensible language, just that you might be rewriting away syntax sugar, lowering things, etc.

Lucas points out that this design would still be compatible with the idea of closed nonterminals, but closedness would be specific to the nonterminal, so you could transform from a closed nonterminal to a non-closed one. This would allow feature developers and analysis developers to extend the same language without the boilerplate of defining a separate AST type.

Pattern matching

Pattern matching now has the typical simple semantics and efficient translation that it would in a ML-family language; hooray! As far as I'm aware, all the normal properties of pattern matching on algebraic data types apply, since you're pretty much just doing that.

Connections

Nanopass

I came up with this to be able to write a compiler more conveniently in the nanopass style, where you have many IR types, most of which are expressed as a diff of the previous:

(define-language L0
  (terminals
    (variable (x))
    (primitive (pr))
    (datum (d))
    (constant (c)))
  (Expr (e body)
    x
    pr
    c
    'd
    (begin e* ... e)
    (if e0 e1)
    (if e0 e1 e2)
    (lambda (x* ...) body* ... body)
    (let ([x* e*] ...) body* ... body)
    (letrec ([x* e*] ...) body* ... body)
    (e0 e1 ...)))

(define-language L1
  (extends L0)
  (terminals
    (- (constant (c))))
  (Expr (e body)
    (- c
       (if e0 e1)
       (lambda (x* ...) body* ... body)
       (let ([x* e*] ...) body* ... body)
       (letrec ([x* e*] ...) body* ... body))
    (+ (lambda (x* ...) body)
       (let ([x* e*] ...) body)
       (letrec ([x* e*] ...) body))))

The question I was asking myself was "how do you have strong types in your metalanguage, without having L0_App and L1_App as separate types."

Scala

This idea to answer the above was inspired by the way one encodes algebraic data types in Scala. Rather than having algebraic data types as a primitive notion, Scala has the idea of a sealed trait -- in essence, an interface that can only be implemented by types in the same package. Since the set of types in the package is known statically, the language can support exhaustive pattern-matching.

This doesn't preclude the same class belonging to multiple traits, however.

I don't think you can implement all of this as a "nice" Scala library (i.e. not just writing a compiler to Scala), though. (I'm not a Scala user, however, so there might be a way to that I'm just unaware of; the This type feels like it needs special treatment, though.)

Rejected Ideas

Encoding in datatypes

I've failed to do this as a Haskell library a few times now, where each production becomes a datatype and you glue them together with :+:, like in "Data types a la carte."

Talking about what equations can be computed just gets really gross, I think at a minimum you'd want to know the flowtypes when you're generating the code.

Decoupling transforms from nonterminals

In principle, you could use structural typing for nonterminals, and allow anonymous nonterminals; then, you could type each of the transform attribute equations separately, and talk about their union:

transform attribute desugar occurs on IfThen;

prod IfThen(cond: This, then_: This) {
  this.desugar = IfThenElse(cond, then_, UnitLit());
}

transform attribute purelyFunctional occurs on Print;

prod Print(expr: This) {
  this.purelyFunctional = UnitLit();
}

type Expr0 := nonterminal { BoolLit, UnitLit, Print, IfThen, IfThenElse }
type Expr2 := nonterminal { BoolLit, UnitLit,                IfThenElse }

// Well, maybe this should be an attribute? Whatever, this is fakey syntax
// anyway.
def expr0To1: Expr0 -> Expr1 := transform Expr0 with { desugar <+ purelyFunctional };

The syntax transform Expr0 with { desugar <+ purelyFunctional } represents the idea of using the desugar transform attribute if it exists, and otherwise falling back to the purelyFunctional transformation. If no transform equation applies, an identity transform happens.

We can type the IfThen.desugar equation as having type nonterminal { IfThenElse, UnitLit }; it returns IfThenElse directly, and must contain UnitLit in order for UnitLit() to be a valid argument to IfThenElse.

If we look at the inferred type for desugar <+ purelyFunctional on each production in Expr0:

We can then take the union of the inferred types to get:

transform Expr0 with { desugar <+ purelyFunctional } :
  Expr0 -> nonterminal { BoolLit, UnitLit, IfThenElse } // aka Expr0 -> Expr1

(I think this is too simplistic to work if you can pattern-match on children in transform attribute equations; checking it however OCaml's polymorphic variants get checked is probably better?)

This might actually be a good idea to support extensible languages, but I haven't given it enough thought to state that conclusively. It certainly adds a bunch of complexity, so I'm not including it for now.

Transform equations manually transform children

Instead of writing:

prod IfThen(cond: This, then_: This) {
  this.toExpr1 = IfThenElse(cond, then_, UnitLit());
}

the programmer could be required to write:

prod IfThen(cond: This, then_: This) {
  this.toExpr1 = IfThenElse(cond.toExpr1, then_.toExpr1, UnitLit());
}

This is more explicit, which is arguably a virtue in its own right. However, it doesn't let you get rid of the This type (trying too hard got me into the "Encoding in datatypes" pitfall above), so it has a marginal effect on the complexity of the language.

Performing the recursive rewrite in the compiler / runtime also gives a convenient place to inject any sort of "copy over attributes we want to ensure we won't have to recompute" logic.