[转]Matt Might: Writing an interpreter, CESK-style
原文:http://matt.might.net/articles/cesk-machines/
Writing an interpreter, CESK-style
Matthias Felleisen's CESK machine[1] provides a simple yet powerful architecture for implementing interpreters (among many other benefits).
The CESK approach easily models languages with rich features like:
- mutation;
- recursion;
- exceptions;
- first-class continuations;
- garbage collection; and
- multi-threading.
The host language need not support any of these features.
The CESK machine is a state-machine in which each state has four components: a (C)ontrol component, an (E)nvironment, a (S)tore and a (K)ontinuation. One might imagine these respectively as the instruction pointer, the local variables, the heap and the stack.
This article discusses how to build a CESK machine for A-Normalized lambda calculus (ANF), a high-level intermediate representation for functional programs.
A working interpreter is provided in Racket.
See page 60 of Matthias Felleisen's dissertation for a definition.
Machine-based interpreters
The CESK machine is a machine-based interpreter.
(Most would actually call it a semantics, since it is formally defined.)
At a high-level, a machine-based interpreter has four components:
- Prog -- the set of programs.
- Σ -- the set of machine states.
- inject:Prog→Σ -- a program to initial-state injection function.
- step:Σ⇀Σ -- a (partial) state to state transition function.
Given a program p∈Prog, the interpreter first injects it into an initial machine state ς0:
The algorithm for running the interpreter is then simple:
ς := ς0
while ς is defined for step:
ς := step(ς)
A note on determinism
The structure of the step function assumes deterministic evaluation. Nondeterministic evaluation requires a function that yields multiple potential successor states: step:Σ→P(Σ).
C, E, S and K
The CESK machine takes its name from the four components that comprise a state of its execution: the control string, the environment, the store and the continuation.
C
Depending on the language being interpreted, the control string could be as simple as a program counter, a statement or an expression.
In this article, the control string is an expression.
E
The environment is a structure, almost always a map, that associates variables with an address in the store.
The environment can be implemented as a purely functional map (hash- or tree-based) or even directly as a first-class function.
S
The store, which some might liken to a heap or memory, is a map from addresses to values.
Like the environment, the store can be a map (hash- or tree-based) or a first-class function.
K
The continuation component is a representation of the program stack, often times represented exactly as a list of frames, or as an implicitly linked list.
A-Normal Form
A-Normal Form is a normalized variant of the lambda-calculus.
Transforming a language to ANF is straightforward, and it simplifies the structure of an interpreter.
Here's a sample BNF grammar for a reasonable variant on ANF:
lam ::= (λ (var1
... varN
) exp)
aexp ::= lam
| var
| #t | #f
| integer
| (prim aexp
1
... aexpN
) cexp ::= (aexp0
aexp1
... aexpN
)
| (if aexp exp exp)
| (call/cc aexp)
| (set! var aexp)
| (letrec ((var
1
aexp1
) ... (varN
aexpN
)) exp)
exp ::= aexp
| cexp
| (let ((var exp)) exp)
prim ::= + | - | * | =
There are three kinds of expressions:
- Atomic expressions (aexp) are those for which evaluation must always terminate, never cause an error and never produce a side effect.
- Complex expressions (cexp) may not terminate, may produce an error and may have a side effect. However, a complex expression may defer execution to only one other complex expression. For instance,
letrecdefers directly to its body, and if defers to only one of its arms. - Expressions (exp) can be atomic, complex or let-bound. A let-bound expression will first defer execution to the binding expression, and then resume execution in the body.
This structure forces order of evaluation to be specified syntactically.
For instance, the meaning of the expression ((f x) (g y)) is undefined until we know whether (f x) or (g y) is executed first. In ANF, this expression is illegal, and must be written:
(let ((fx (f x))) (let ((gy (g y))) (fx gy)))or
(let ((gy (g y))) (let ((fx (f x))) (fx gy)))
so that there is no ambiguity.
A formal definition
A formal definition of the CESK machine guides the code.
If you're unfamiliar with formal mathematical notation, you may want to review my article on the connection between discrete mathematics and code.
If you're only interested in running code, skip ahead.
The state-space, Σ, of the CESK machine for ANF has four components:
In case it's not clear, Exp is the set of of all expressions defined by the earlier grammar. Also, the notation ς∈Σ is a hint that the symbol ς will be used to denote members of the set Σ.
Environments
The environment in a machine state is a partial function that maps a variable to its address:
It has to be a partial function, because not all variables are in every scope.
Once again, the hint ρ∈Env indicates that the symbol ρ will be used to denote environments.
Stores
A store maps addresses to values:
In a CESK machine, variable look-up is a two-stage process: first to an address (through some environment), then to a value (through the store).
Values
There are five kinds of values in this machine--void, booleans, integers, closures and first-class continuations:
In the set of values, z is an integer, while #t and #f are booleans.
A closure pairs a lambda term with an environment to define the values of its free variables. The environment is necessary because a term like (λ () x) is undefined, unless an environment specifies the value of x.
Continuations are included in values because the language includes call/cc, which enables the creation of first-class continuations.
Continuations
A continuation is effectively the program stack.
Creating a continuation allows us to divert to a complex sub-computation and return later.
So, a continuation needs enough information to resume execution.
For this machine, diverting to a sub-computation only happens in let-bound expressions.
Given a let-bound expression (let ([v exp]) body), execution will first go toexp, which means that when it finishes evaluating exp, the result will bind tov, and execution will resume with body.
In a CESK machine, the assumption is that the current computation is always executing on behalf of some continuation awaiting its result. (The special initial continuation, which awaits the result of the program, is called halt.)
Consequently, continuations nest within continuations.
Finally, every continuation must contain the local environment that knows the addresses of the variables in scope.
Putting this all together lets us formally define the space of continuations:
Evaluating atomic expressions
Atomic expressions (aexp in the grammar) are easy to evaluate with an auxilary semantic function, A:AExp×Env×Store⇀Value:
Integers evaluate to themselves:
Booleans do too:
Lambda terms become closures:
Primitive expressions are evaluated recursively:
where O:Prim→(Value∗⇀Value) maps a primitive operation (by name) to its corresponding operation.
Stepping forward
To define the step function, step:Σ⇀Σ, for this machine, we need a case for each expression type.
Procedure call
In a procedure call, the step function first evaluates the expression for procedure to be invoked, and then the expressions for the arguments to be supplied.
Then it applies that procedure to those arguments.
Return
When the expression under evaluation is an atomic expression, it indicates that the current sub-computation is finished and we need to return the result to the current continuation, which has been patiently awaiting it:
where the auxilary function applykont:Kont×Value×Store⇀State(defined below) applies a continuation to a value.
Conditionals
Conditional evaluation is straightforward: the condition is evaluated, and the right expression is chosen for the next state.
Let
Evaluating let will force the creation of a continuation.
Since execution first evaluates the bound expression, the continuation will contain enough information to resume execution in the body of the let.
where κ′=letk(v,body,ρ,κ).
Mutation
The CESK approach makes mutation straightforward: look up the address to be changed, and then overwrite that address in the store.Notation. Given a function (or partial function) f:X→Y, the function f[x↦y] is identical to f except that x yields y:
Recursion
Handling recursion requires establishing self-reference. In a language like Scheme, the construct letrec is often compiled into "lets and sets"; that is:
(letrec ([v1 exp1]
...
[vN expN])
body)
becomes:
(let ([v1 (void)]
...
[vN (void)])
(set! v1 exp1)
...
(set! vN expN)
body)
A CESK machine can fake this by extending the environment first, and then evaluating the expressions in the context of the extended environment:
where:
First-class continuations
First-class continuations are a powerful construct, since they allow the simulation of so many other control constructs. For instance, exceptions are merely syntactic sugar on top of continuations.
And, continuations can do many other things too.
The procedure call/cc captures the current continuation as a first-class procedure:
Applying procedures
The auxiliar function applyproc:Value×Value∗×Store×Kont⇀Σapplies a procedure to a value.
Applying continuations
The auxilary function applykont:Kont×Value×Store⇀Stateapplies a continuation to a return value:
where a∉dom(σ) is a fresh address.
As running code
I've transliterated the math here directly into working Racket code for a CESK interpreter.
Further reading
There are a few good books on implementing compilers and interpreters for functional languages.
The classic MIT text, Structure and Interpretation of Computer Programs, is worth the read:
Lisp in Small Pieces is a consistent recommendation in the courses I teach:
For advanced techniques, Appel's Compiling with Continuations remains my favorite reference:
Related pages
- Tree transformations: Desugaring Scheme
- Lexical analysis in Racket
- Grammar: The language of languages (BNF, EBNF, ABNF)
- What is static program analysis?
- Implementing Java as a CESK machine, in Java
- Order theory for computer scientists
- HOWTO: Translate math into code
- Writing CEK-style interpreters in Haskell
- Closure conversion: How to compile lambda
- How to compile with continuations
- Understand exceptions by implementing them
- A-Normalization: Why and How
- Compiling up to the λ-calculus
- Parsing with derivatives (Yacc is dead: An update)
- By example: Continuation-passing style in JavaScript
- 7 lines of code, 3 minutes: Implement a programming language
- Architectures for interpreters
- First-class macros from meta-circular evaluators
- Programming with continuations by example
- Compiling Scheme to C
- Compiling to Java
- Church encodings in Scheme
- Non-termination without loops, iteration or recursion in Javascript
- Memoizing recursive functions in Javascript with the Y combinator
- Advanced programming languages
- Recommended books and papers for grad students

浙公网安备 33010602011771号