> Hi folks,
> I'd appreciate any pointers/references on the subject of generating
> human readable source from an AST. ...
> One issue which I've already resolved to my satisfaction is the
> removal of superfluous parenthesis in generated expression code. I've
> done this by carrying around "the binding context" (for lack of a
> better term) during generation. This context is a measure of how
> strongly binding the surrounding operator is; only if the current
> operator is less binding (i.e. has a lower precedence) are parenthesis
> written around the generated code. This scheme also supports handing
> binary operator associativity by slightly decreasing the "binding
> context" for left or right sub-expressions as appropriate.
> Some outstanding issues I'm trying to resolve include:
> 1) How to split long lines especially for long expressions.
> 2) Adding extra parenthesis even if they are not strictly
> necessary from a grammatical point of view but might
> improve the readability.
> 3) Neatly abstracting "layout preferences" from the AST
> walking code.
I have prettyprinters for several languages going back a long way.
They parse and prettyprint the entire language, including expressions,
designators, etc. The general idea is like this:
For each syntactic construct, the formatting routine tries to display
it "horizontally", i.e. all on one line, given a starting position on
the current line. If it won't fit, then it is displayed "vertically",
i.e. on multiple lines. Each construct has its own rule for vertical
formatting. Either way, the nested constructs are formatted in their
The kicker is that the starting position has to come from the parent
construct, and the parent formatting routine doesn't know this yet,
until it has seen how much space the children constructs will require.
In my old prettyprinters, there is no AST built. It is done on the
fly, using recursive descent parsing and formatting. So each routine
tentatively formats its construct vertically, passing down start
positions to child constructs as determined by vertical formatting.
The childen routines do the same, but also return the number of
characters their construct whould take if formatted horizontally.
The parent must, of course, compute this same number for its own
construct, to return to its caller. But it uses it to make a late
decision to change to horizontal formatting.
A consequence of this approach is that if a construct is horizontal,
then so are all its descendents. This precludes "L-formatting", e.g.:
CallProcedureWithALongName ( ThisIsTheFirstActualParameter
Sometimes not using L-formatting takes more lines.
The details of representing the speculatively-formatted stuff get a
bit messy. If you felt you could just let the entire program
accumulate in in-memory data structure at once, it would be easy, but
I didn't think I could afford that at the time. With today's
real/virtual memory sizes, it might be different.
There are a number of additional problems. There is also a "filled"
formatting style for lists of things (e.g. enumeration literals in an
enumeration type). Each line is filled with list items like filling a
line with words in a word processor, except that an item is not
necessarily a single word, e.g. an actual parameter. Occasionally, I
have rules to fill certain pieces of a construct which are not a
variable length list, but fixed syntax.
To this day, whether formulating rules for automated prettyprinting or
manually formatting code, I still can't decide whether I like the
filled format. All vertical seems pretty extravagant, but filled
seems pretty hard to read for things like parameter lists.
I sometimes have a few constructs which are always vertical, e.g. a
module/package/compilation unit, a procedure body, etc.
Comments and blank lines are another complication. I have some
convoluted rules that distinguish different kinds of comments by the
way they appear in the input. They do a passible job some of the
time, not what I really want other times.
Recently, I have done a little of this from an AST. This is a lot
easier to implement, because the node attribute
DisplayedLengthAssumingHorizontal can be first computed bottom up,
then the formatting done in the obvious way, emitting output directly.
There is no need to do speculate-then-change-your-mind formatting,
with its messy data structure.
The method you mentioned for inserting parentheses just where they are
needed to preserve evaluation order works fine. You can also have a
one-child AST node which means "Enclose in parentheses whether they
are needed or not" to preserve places where input code had them (or,
you decided they would improve readability).
Gnat uses a different method. Instead of introducing extra AST nodes
for added parentheses, every expression node has a field which is the
number of pairs of parentheses around this subexpression. However, it
is packed into two bits, so more than three pairs will get lost. I
agree with the designers that this is not a big loss.
I have experimented with designing a "format syntax" notation. Basic
format syntax gives, for each AST node, the delimiters which need to
be inserted before/between/after the formattings of the children of
the node. By itself, it only tells how to construct a token stream.
You can add LineBreak symbols (with relative indentations) to a format
syntax rule to show the vertical formatting. A propery of the rule
can tell if it is always vertical, horizontal if possible, or filled.
You can also allow bracketing of substrings of the token stream
defined by the rule and let these have different formatting
I have designed such a metalanguage, but I did not implement it as
such, as these have all been hurry-up jobs. Instead, I wrote a few
procedures with very short names corresponding to the formatting
symbols. Hard coding of the formatting for a particular AST node, with
calls on these routines, looks something like the rule would look, if
written in the format syntax.
Two political problems:
1) Programmmers often have extremely strong feelings about formatting
(myself included), and
2) It would be rare indeed to find any two whose formatting
All the better reason to abstract the formatting rules.